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1. Definition and Goals of Distributed System. 

1.1 Definition 
A distributed system is a collection of independent computers that appear to the users of 
the system as a single computer. 

e Using high performance computers connected by equally high speed communication links, it 
is possible to build a single system consisting of multiple computer and using it as a single 
consolidated system. 

e Insuch a system, multiple resources work together to deliver the required processing speed 
and the operating system takes care of maintaining the system and overall maintenance. 

e In distributed system computers are not independent but interconnected by a high speed 
network. 

e It means that many computers, be they workstation or desktop systems linked together, can 
do work of a high performance supercomputer. 

e If user thinks that the entire interlinked system is a single and unified computer, there has to 
be software envelope that joins multiple operating system together and offers a seamless 
interface to user 
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Figure1.1:Example of Distributed system 
1.2 Goals 
Advantages of Distributed system over Centralized system. 
(a) Economics 
= A quarter century ago, according to Grosch’s law: the computing power of a CPU is 
proportional to the square of its price. 
= By paying twice as much you could get four times the performance. 
= This observation fit the mainframe technology of its time. 
= With microprocessor technology, Grosch’s law no longer holds. 
=" Fora few hundred dollars you can get the CPU chip that can execute more instructions 
per second than one of the largest 1980s mainframe. 
(b) Speed 
" A collection of microprocessors cannot only give a better price/performance ratio 
than a single mainframe, but also give an absolute performance that no mainframe 
can achieve at any price. 
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= For example, with current technology it is possible to build a system from 10,000 
modern CPU chips, each of which runs at 50 MIPS (Millions of Instructions Per 
Second), for a total performance of 500,000 MIPS. 

=" For a single processor (i.e., CPU) to achieve this, it would have to execute an 
instruction in 0.002 nsec (2 picosecond). 

=" No existing machine even comes close to this. 

(c) Inherent Distribution 

=" Many institutions have applications which are more suitable for distributed 
computation. 

=" Assume that there is large company buying and selling goods from different countries. 

= Its offices situated in those countries are geographically diverse. 

=" If a company wishes unified computing system, it should implement a distributed 
computing system. 

=" Consider the global employee database of such a multinational company. 

=" Branch offices would create local database and then link them for global viewing. 

(d) Reliability 

= By distributing the workload over many machines, a single chip failure will bring down 
at most one machine, leaving the rest intact. 

= Ideally, if 5 percent of the machines are down at any moment, the system should be 
able to continue to work with a 5 percent loss in performance. 

=" For critical applications, such as control of nuclear reactors or aircraft, using a 
distributed system to achieve high reliability may be the dominant consideration. 

(e) Incremental Growth 

=" Acompany will buy a mainframe with the intention of doing all its work on it. 

= If the company expands and the workload grows, at a certain point the mainframe will 
no longer be adequate. 

= The only solutions are either to replace the mainframe with a larger one (if it exists) 
or to add a second mainframe. 

=" Both of these can cause damage on the company's operations. 

= In contrast, with a distributed system, it may be possible simply to add more 
processors to the system, thus allowing it to expand gradually as the need arises. 


Item Description 
Economics Microprocessors offer a better price/performance than mainframes 
Speed A distributed system may have more total computing power than a 
mainframe 
Inherent Some applications involve spatially separated machines. 
distribution 
Reliability If one machine crashes, the system as a whole can still survive 
Incremental | Computing power can be added in small increments. 
growth 


Table1.1: Advantages of distributed systems over centralized systems. 
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Advantages of Distributed System over Independent PCs 
(a) Data sharing 
= Many users need to share data. 
= For example, airline reservation clerks need access to the master data base of flights 
and existing reservations. 
=" Giving each clerk his own private copy of the entire data base would not work, since 
nobody would know which seats the other clerks had already sold. 
= Shared data are absolutely essential for this and many other applications, so the 
machines must be interconnected. 
(b) Resource sharing 
= Expensive peripherals, such as color laser printers, phototypesetters, and massive 
archival storage devices (e.g., optical jukeboxes), can also be shared. 
(c) Communication 
= For many people, electronic mail has numerous attractions over paper mail, 
telephone, and FAX. 
= It is much faster than paper mail, does not require both parties to be available at the 
same time as does the telephone, and unlike FAX, produces documents that can be 
edited, rearranged, stored in the computer, and manipulated with text processing 


programs. 
(d) Flexibility 
= A distributed system is more flexible than giving each user an isolated personal 
computer. 


= One way is to give each person a personal computer and connect them all with a LAN, 
this is not the only possibility. 

= Another one is to have a mixture of personal and shared computers, perhaps of 
different sizes, and let jobs run on the most appropriate one, rather than always on 
the owner's machine. 

= In this way, the workload can be spread over the computers more effectively, and the 
loss of a few machines may be compensated for by letting people run their jobs 


elsewhere. 
Item Description 
Data sharing Allow many users access to a common data base 


Device sharing | Allow many users to share expensive peripherals like color printers 
Communication | Make human-to-human communication easier, for example, by 
electronic mail 
Flexibility Spread the workload over the available machines in the most cost 
effective way 

Table1.2: Advantages of Distributed System over PCs. 
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2. Hardware Concepts 


Parallel and 


distributed 
computers 


Tightly Coupled 


Loosely Coupled 
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(Shared Memory) (Private Memory) 


Sequent, Encore Ultra computer, RP3. Workstation on a LAN Hypercube, Transputer 
Figure1.2: Classification of distributed system. 


e Even though all distributed systems consist of multiple CPUs, there are several different ways 
the hardware can be organized, especially in terms of how they are interconnected and how 
they communicate. 

e Various classification schemes for multiple CPU computer systems have been proposed over 
the years. 

e According to Flynn, classification can be done based on, the number of instruction streams 
and number of data streams. 

e SISD 
o Acomputer with a single instruction stream and a single data stream is called SISD. 

o All traditional uniprocessor computers (i.e., those having only one CPU) fall in this 
category, from personal computers to large mainframes. 
e SIMD 
o The next category is SIMD, single instruction stream, multiple data stream. 
co This type refers to array processors with one instruction unit that fetches an instruction, 
and then commands many data units to carry it out in parallel, each with its own data. 

o These machines are useful for computations that repeat the same calculation on many 
sets of data, for example, adding up all the elements of 64 independent vectors. 

o Some supercomputers are SIMD. 

e MISD 
o The next category is MISD, multiple instruction stream, single data stream, no known 

computers fit this model. 

e MIMD 
o Finally, comes MIMD, which essentially means a group of independent computers, each 

with its own program counter, program, and data. 
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o All distributed systems are MIMD, 
o We divide all MIMD computers into two groups: those that have shared memory, usually 
called multiprocessors, and those that do not, sometimes called multicomputer. 


2.1 Tightly Coupled system: 


e Tightly coupled systems are used as parallel system. 


e In this system, there is a single system wide primary memory that is shared by all processors. 
Memory 
Figure1.3: A bus based multiprocessor. 


o Bus based multiprocessor consist of number of CPUs all connected in a common bus, 
along with a memory module. 

o Words are written and read from common memory module. 

o Since there is only one memory , if CPU A writes a word to memory and then CPU reads 
that word back a microsecond later , B will get the value just written. 

o This causes a network overload and performance will drop drastically. 

o Solution is to add a high speed cache memory. 

o Cache holds most recently accessed words, if requested word is in cache, cache itself 
responds to CPU else it’s fetched from memory. 

o Again creates a problem, if two CPUs read a same word into their respective caches and 
first CPU overwrites the word followed by read operation of second CPU. 

o The second CPU will get different content. 

o Solution to this problem is writing through cache: if words are written to cache, it is also 
written to memory. 

o Other solution is monitoring the bus, if a cache sees a write operation; it either removes 
that entry from its cache or updates the cache entry with new value. 

o Sucha cache is called snoopy cache 

o Adesign consisting of snoopy cache is coherent and is invisible to programmer. 

(a) Switched Multiprocessors. 

o To build a multiprocessor with more than 64 processor, a different method is needed to 
connect the CPUs with the memory. 

o One possibility is to divide the memory up into modules and connect them to the CPUs 
with a crossbar switch, as shown in Figure1.4. 

o Each CPU and each memory has a connection coming out of it, as shown. 

o At every intersection is a tiny electronic crosspoint switch that can be opened and closed 
in hardware. 


(a) Bus based Multiprocessor. 
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o WhenaCPU wants to access a particular memory, the crosspoint switch connecting them 


is closed momentarily, to allow the access to take place. 
Memories 


CPUs 


Crosspoint switch 2 X 2 switch 


Figure1.4: Types of switched multiprocessor. 
o The downside of the crossbar switch is that with n CPUs and n memories, n? crosspoint 
switches are needed. 
For large n, this number can be prohibitive. 
Thus new technique omega switched network was developed. 
This network contains four 2x2 switches, each having two inputs and two outputs. 
Each switch can route either input to either output. 
Every CPU can access every memory. 
Here with n CPUs and n memories, the omega network requires logzn switching stages, 
each containing n/2 switches, for total of (n logzn)/2 switches. 
o For large n this is much better than n7, it is still substantial. 


OoOO000 0 0 


2.2 Loosely Coupled system: 
(a) Bus Based Multicomputer. 
o Each CPU has a direct connection to its own local memory. 
o It seems that topology is similar to bus based multiprocessor, but since there will be 
much less traffic over it, and there is no need of high — speed backplane bus. 


Workstation Workstation Workstation Workstation 


Figure1.5: Bus Based Multicomputer 
o Thus multicomputer is a collection of workstation on a LAN than a collection of CPU 
cards inserted into a fast bus. 
(b) Switched Multicomputers. 
o Figure 1.6 shows two popular topologies a, (a) grid and (b) hypercube. 
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(a) (b) 
Figure1.6: Switched Multicomputer. 


o Grids are easy to understand and lay out on printed circuit boards. 

o They are best suited to problems that have an inherent two-dimensional nature, such as 
graph theory or vision (e.g., robot eyes or analyzing photographs). 

A hypercube is an n—dimensional cube. 

The hypercube of figure is four-dimensional. 

It can be thought of as two ordinary cubes, each with 8 vertices and 12 edges. 

Each vertex is a CPU. 

Each edge is a connection between two CPUs. 

The corresponding vertices in each of the two cubes are connected. 

Since only nearest neighbors are connected, many messages have to make several hops 
to reach their destination. 


OoOO0000 0 0 


3. Software Concepts 

e The image that a system presents to its users, and how they think about the system, is largely 
determined by the operating system software, not the hardware. 

e There are basic two types of operating system namely (1) Tightly coupled operating system 
and (2) loosely couple operating system; for multiprocessor and multicomputer . 

e Loosely-coupled software allows machines and users of a distributed system to be 
fundamentally independent of one another. 

e Consider a group of personal computers, each of which has its own CPU, its own memory, its 
own hard disk, and its own operating system, but which share some resources, such as laser 
printers and data bases, over a LAN. 

e This system is loosely coupled, since the individual machines are clearly distinguishable, each 
with its own job to do. 

e If the network should go down for some reason, the individual machines can still continue to 
run to a considerable degree, although some functionality may be lost. 

e For tightly coupled system consider a multiprocessor dedicated to running a single chess 
program in parallel. 

e Each CPU is assigned a board to evaluate, and it spends its time examining that board and all 
the boards that can be generated from it. 

e When the evaluation is finished, the CPU reports back the results and is given a new board to 
work on. 

e The software for this system, both the application program and the operating system 
required to support it, is clearly much more tightly coupled than in our previous example. 
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3.1 Tightly Coupled System 
(a) Distributed Operating System 
o Distributed Operating system is a tightly coupled software on loosely coupled hardware. 
o The goal of such a system is to create the illusion in the minds of the users that the entire 
network of computers is a single timesharing system, rather than a collection of distinct 
machines. 
o Characteristics: 
=" There is a single, global interprocess communication mechanism so that any process 
can talk to any other process. 

= There is a global protection scheme. 

=" Process management is also uniform everywhere. 

= How processes are created, destroyed, started, and stopped does not vary from 
machine to machine. 

=" The file system looks same from everywhere. 

=" Every file is visible at every location, subject to protection and security constraints. 

= Each kernel has considerable control over its own local resources. 

= For example, if swapping or paging is used, the kernel on each CPU is the logical place 
to determine what to do swap or page. 

(b) Multiprocessor Timesharing System. 

o Multiprocessor timesharing system is a tightly coupled software over tightly coupled 
hardware. 

o While various special-purpose machines exist in this category the most common general- 
purpose examples are multiprocessors that are operated as a UNIX timesharing system, 
but with multiple CPUs instead of one CPU. 

o The key characteristic of this class of system is the existence of a single run queue: a list 
of all the processes in the system that are logically unblocked and ready to run. 

o The run queue is a data structure kept in the shared memory. 


E (read 


D (read 


Process A Process B C running 
A (running) 


running running 


Run Queue D, E 
Operating System 


Figure1.7: Multiprocessor Timesharing System 

o Asanexample, consider the system of Figure 1.7, which has three CPUs and five processes 
that are ready to run. 

co. All five processes are located in the shared memory, and three of them are currently 
executing: process A on CPU 1, process B on CPU 2, and process C on CPU 3. 

o The other two processes, D and E, are also in memory, waiting their turn. 

Now suppose that process B blocks waiting for I/O or its quantum runs out. 

o In both cases CPU 2 must suspend it, and find another process to run. 


Nirali C Dutiya, CE Department | 2160710 — Distributed Operating System | 8 


oO 


@ 
C | Darshan Unit 1 —Introduction to Distributed Systems 


Institute of Engineering & Technology 


o CPU 2 will normally begin executing operating system code located in the shared memory. 

o After having saved all of B's registers, it will enter a critical region to run the scheduler to 
look for another process to run. 

o It is essential that the scheduler be run as a critical region to prevent two CPUs from 
choosing the same process to run next. 

o The necessary mutual exclusion can be achieved by using monitors, semaphores, or any 
other standard construction used in single processor systems. 

o Once CPU 2 has gained exclusive access to the run queue, it can remove the first 
entry, D, exit from the critical region, and begin executing D. 

o Initially, execution will be slow, since CPU 2's cache is full of words belonging to that part 
of the shared memory containing process B, but after a little while, these will have been 
purged and the cache will be full of D's code and data, so execution will speed up. 

o Because none of the CPUs have local memory and all programs are stored in the global 
shared memory, it does not matter on which CPU a process runs. 

o If along-running process is scheduled many times before it completes, on the average, it 
will spend about the same amount of time running on each CPU. 

co If all CPUs are idle, waiting for I/O, and one process becomes ready, it is slightly preferable 
to allocate it to the CPU it was last using, assuming that no other process has used that 
CPU. 


3.2 Loosely Coupled System 
(a) Network Operating System. 
o Atypical example of loosely-coupled software on loosely-coupled hardware is a network 
of workstations connected by a LAN. 
In this model, each user has a workstation for his exclusive use. 
It may or may not have a hard disk. 
It definitely has its own operating system. 
All commands are normally run locally, right on the workstation. 
Sharing can be obtained by a few commands , as below 
(i) rlogin machin2 
=" This command allows you to log into another computer (machine2) and work as if 
it is your home computer. 
= You can use your computer as a remote terminal and run your program in another 
system, using the superior processing power and better software resources. 
(ii) rcp machinet1: file1 machine:file2 
=" This is a remote copy command to copy files from one system to another. 
=  File1 from machine! will be copied to machine2 and named as file2. 
=" User has to be aware of the location of file and give specific commands. 
o As shown in Figure 1.8 in NOS, there is a single file system implemented on one or more 
systems, having multiple disks or using RAID technique. 
o The maintenance of files and disk drives is done centrally, hence users can keep their 
system small and minimal. 


On OO" -O:. *O 
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o They can load their program from any workstation connected to the server and use the 
files stored at the file server. 

o The file server accepts request for files from the programs running in any workstation, 
processes the request and either transfers the entire file or relevant block to the 
requesting program. 


Disks on which 


client2 shared files 


client1 
are stored 


Request 
——> 


+——__ 


Reply 


Figure1.8: Network Operating System. 


3.3 Difference between Network Operating System, Distributed Operating 
System, Multiprocessor time sharing system. 


Item Network Distributed Multiprocessor 
Operating Operating Operating 
System System System 

Does it look like a virtual NO Yes Yes 
uniprocessor? 
Do all have to run the same operating NO Yes Yes 
system? 
How many copies of the operating N N 1 
system are there? 
How is communication achieved? Shared Files Messages Shared Memory 
Are agreed upon network protocols Yes Yes NO 
required? 
Is there a single run queue? NO No Yes 
Does file sharing have well-defined Usually no Yes Yes 
semantics? 


Table1.3: Comparison of three different ways of organizing n CPUs. 


4 Design Issues in Distributed Operating System 
e The issues in designing a distributed system include: 
4,1 Transparency 
e Adistributed system that is able to present itself to user and application as if it were only a 
single computer system is said to be transparent. 
e There are eight types of transparencies in a distributed system. 
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(a) Access Transparency: 
o It hides differences in data representation and how a resource is accessed by a user. 
o Example, a distributed system may have a computer system that runs different operating 
systems, each having their own file naming conventions. 
o Differences in naming conventions as well as how files can be manipulated should be 
hidden from the users and applications. 
(b) Location Transparency: 
o Hides where exactly the resource is located physically. 
o Example: by assigning logical names to resources like yahoo.com, one cannot get an idea 
of the location of the web page’s main server. 
(c) Migration Transparency: 
o Distributed system in which resources can be moved without affecting how the resource 
can be accessed are said to provide migration transparency. 
o It hides that the resource may move from one location to another. 
(d) Relocation Transparency: 
o This transparency deals with the fact that resources can be relocated while it is being 
accessed without the user who is using the application to know anything. 
o Example: using a Wi-Fi system on laptops while moving from place to place without 
getting disconnected. 
(e) Replication Transparency: 
o Hides the fact that multiple copies of a resource could exist simultaneously. 
o To hide replication, it is essential that the replicas have the same name. 
o Consequently, as system that supports replication should also support location 
transparency. 
(d) Concurrency Transparency: 
o It hides the fact that the resource may be shared by several competitive users. 
o Example: two independent users may each have stored their file on the same server and 
may be accessing the same table in a shared database. 
o Insuch cases, it is important that each user should not notice that the others are making 
use of the same resource. 
(e) Failure Transparency: 
o Hides failure and recovery of the resources. 
o Example: a user cannot distinguish between a very slow or dead resource. 
o Same error message come when a server is down or when the network is overloaded of 
when the connection from the client side is lost. 
o So here, the user is unable to understand what has to be done, either the user should 
wait for the network to clear up, or try again later when the server is working again. 
(f) Persistence Transparency: 
o It hides if the resource is in memory or disk. 
o Example: Object oriented database provides facilities for directly invoking methods on 
storage objects. 
o First the database server copies the object states from the disk i.e. main memory 
performs the operation and writes the state back to the disk. 


Nirali C Dutiya, CE Department | 2160710 — Distributed Operating System 11 


@ 
: | Darshan Unit 1 —Introduction to Distributed Systems 


o The user does not know that the server is moving between primary and secondary 
memory. 


4.2 Reliability. 
e Reliability in terms of data means that data should be available without any errors. 
e Distributed system where multiple processors are available and the system become reliable. 
e Soon failure, a backup file is available. 
e Incase of replication all copies should be consistent. 


4.3 Scalability: 

e To support more users or resources, there are limitations with centralized services, tables, 
and algorithms. 

e Centralized services are the ones which are implemented by a single server running on a 
specific machine in the distributed system. 

e As the number of users grows, the server will face manageability issues. 

e Centralized data in a single database ensures security but may lead to communication failures 
and bottleneck. 

e Query handling may also be very slow due to widely spread data. 

e Centralized algorithms are used to route messages in the network. 

e The information about the load on the machines is gathered and graph theory algorithm is 
used to calculate the optimal route. 

e This scheme has a problem of overload, which can be avoided by decentralization. 

e To overcome above problems three scaling techniques are as follows: 

e Hide communication latencies 


Client Server 


FIRST NAME] MAARTEN 
LAST NAME [VAN STEEN 
E-MAIL STEEN@CS.VU.NL 


T=) 


Check form Process form 
(a) 


Client Server 


FIRST NAME 
LAST NAME [van STEEN VAN STEEN ae 


E-MAIL STEEN@CS.VU.NL STEEN@CS.VU.NL 


\ 
Check form Process form 


(b) 
Figure 1.9: Hide communication latencies 


o The client avoids waiting for response to remote service requests and does some 
other useful work in the meanwhile. 

o Whenever the client receives the reply, the application is interrupted and the reply is 
accepted. 

o Figure 1.9, illustrates the process of accessing the database using forms. 
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o Whenever the user fills a form, he enters each field and sends the completed form to 
the server and waits for the acknowledgement. 

o The server checks the form for syntactic errors before accepting the entrie. 

o To improve performance, a better solution is to fill the form on the user's machine, 
check the entries, and then send the completed form to the server. 

o Currently, this approach is supported on the Web through Java applets. 

e Hide distribution 

o This technique, involves splitting the component into several parts and distributing 

them across the network. 
| }— Countries —| 


Figure 1.10: Internet DNS 
As shown in Figure 1.9, the best example is the Internet DNS. 
This namespace is a domain tree structure with distinct zones. 
An independent name server handles each zone. 
Each path name appears as the host name on the Internet. 
To resolve a name means to find the network address of the specific host. 
The naming service is distributed across several machines; a single server does not 
have to deal with all the requests for name resolution. 
e Hide replication 
o Scalability often leads to degradation in performance. 
o The best solution is to replicate the components across the distributed system. 
o This increases availability and assists in load balancing. 
o Caching is another solution. 


OoOO000 0 0 


4.4. Flexibility: 

e To achieve flexibility is to take a decision whether to use monolithic kernel or microkernel on 
each machine 

e Kernel is the central control, which provides the basic system facilities. 

e The major functions of the kernel are: memory management, process management and 
resource management. 

e Ascompared to the monolithic kernel model, the microkernel model has several advantages. 

e Inthe monolithic kernel model, the large size of the kernel reduces the overall flexibility and 
configurability of the resulting operating system. 
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e Onthe other hand, the resulting operating system of the microkernel model is highly modular 
in nature. 

e Due to this characteristic feature, the operating system of the microkernel model is easy to 
design, implement, and install. 

e Moreover, since most of the services are implemented as user-level server processes, it is 
also easy to modify the design or add new services. 


4.5 Performance: 

e The aim of performance transparency is to allow system to be automatically reconfigured to 
improve performance, as load vary dynamically in the system. 

e Message transmission over a LAN takes some time, about one millisecond. 

e To optimize performance, reduce the number of messages transmitted. 

e One solution is to batch the messages together. 

e Other is to cache the data, which is repeatedly needed. 

e Avery small computation such as addition of two numbers does not need to be computed 
remotely. 

e Large computation with low interaction rate and less data transfer are ideal candidates for 
remote communication. 


4.6 Security: 
e Security consists of three main aspects, namely, 
o Confidentiality: Protection against unauthorized access. 
oO Integrity: Protection of data against corruption. 
o Availability: Protection against failure and always being accessible. 
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1. Computer Network and Layered Protocols 


1.1 Computer Network 

e A computer network is a communication system that links end system by communication 
lines and software protocols to exchange data between two processes running on different 
end systems of the network. 

e The end system are often referred to as nodes, sites, hosts, computers, machines and so, on. 

e Size wise a node may be a small microprocessor, a workstation, a minicomputer, or a large 
supercomputer. 

e Function wise a node may be dedicated system without any capability for interactive users, 
a single user personal computer or a general purpose time sharing system. 

e A distributed system is basically a computer network whose nodes have their own local 
memory and may also have other hardware and software resources. 


1.2. Classification of Networks 
e Following are the factors used to classify networks. 

o Number of nodes could be as few as to fit into a room or as large as the internet spanning 
the entire globe. 

o The types of communication links include twisted pair, coaxial cable, and fiber optic 
cable for short distance, and leased services like the telephone lines or satellite channels 
for long distances. 

o The rate of data transmission is proportional to the number of nodes and type of 
communication link. 

o Communication cost include data transfer, administrative, and maintenance costs. 

o The larger the network, the higher will be communication cost. 

e Local area network 

o LANs span over a relatively small area like a building or a group of buildings. 

o They operate at high data transmission speed. 

o These networks comprise workstations and personal computers connected with twisted 
pair, coaxial cables or optical fibre. 

o Each node ina LAN has its own CPU to execute the programs. 

o The node is able to access and share data and resources across the LAN. 

e Wide area Network 

o These networks extend across a relatively large geographical area like cities or countries. 

o A WAN consists of two or more LANs connected using a mixture of transmission media 
like coaxial cable, fiber optic cable, and leased lines or satellites. 

o Hence these networks use routers and other devices for traffic management. 

o These network operate at lower speeds as compared to LANs with latency range of 0.1- 
0.5 sec and a bandwidth range of 1-2 Mbps. 

e Metropolitan area network 

o MANSs use copper or fiber optic cabling to connect computers within 50km area for 
transmission of audio, video and voice information. 

o Ethernet or ATM technology is used to route the data across the network. 
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o MANSs use DSL (Digital Subscriber Line) with ATM switches for a 1.5 km area and cable 
modems beyond that distance. 
o These networks operate at a speed of 0.25-6.0 Mbps. 


WAN To other 
. networks 


Figure2.1: Interconnection of LAN, MAN and WAN. 
e Wireless LAN 
o WLAN technology allows computer network devices to have network connectivity 
without the use of wires, using radio waves. 
o These networks use infrared link, Bluetooth or low power radio network operating at a 
bandwidth of 1-2 Mbps over a distance of 10m. 


Wireless networks 
Wireless LAN Wireless MAN 


irel ; 
Business Ase Bape Cellular Satellite 
ee (Fixed wireless)| | Networks systems 


Example 1: Example 1: Example 1: Example 1: Example 1: 
Bluetooth 802.11b LMDS GSM, 9.6 kbps,| | Motorola 

1 Mbps, 11 Mbps, 37 Mbps, wide coverage Indium 

10 meters 100 Meters 2-4 Km up to 64 Mbps 


Personal 
area 
networks 


Paging 
networks 


Example 1: 
FLEX, 
1.2 kbps 


Example 2: globally 
Other examples: Other Example 2: 3G, 2 Mbps 
Wireless sensor examples: FSO wide coverage Example 2: 
network, UWB 802.11g 1.25 Gbps Deep space 
Hiper LAN2 1-2 Km communication 


Example 2: 
ReFLEX 
6.4 kbps 


Figure2.2: Classification of wireless networks 
o At the top level, wireless networks can be classified as wireless LAN, wireless MAN, and 
wireless WAN. 
o The wireless LANs are further classified as Personal Area networks (eg Bluetooth, 
wireless sensor networks) and Business LAN (e.g 802.11b). 
o Wireless MANs are wireless local loops. 
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o The wireless WANs are classified as cellular networks (e.g GSM), satellite systems and 


paging networks. 


1.3. Layered Protocols -OSI Model 


Process 
A 


Machine2 


Process 
B 


t Application protocol 
Application |¢---------------------- > Application 
t Presentation protocol : 
Presentation ,€-----rrr rrr rrr errr crn > Presentation 
: Session protocol : 
Session = @---------------------- > Session 
t Transport protocol t 
Transport 9 [@---rr nnn --------- nee -- > Transport 
' Network protocol : 
Network taednidusd Maaco Network 
{ Data Link protocol { 
Data Link = }---------------------- > Data Link 
Physical Bi sica Ocn. Physical 


ee 


Figure2.3: OSI Model 

e In the OSI model, communication is divided up into seven levels or layers, as shown in 
Figure2.3. 

e Each layer deals with one specific aspect of the communication. 

e Each layer provides an interface to the one above it. 

e The interface consists of a set of operations that together define the service the layer . 

e Inthe OSI model, when process A on machine 1 wants to communicate with process B on 
machine 2, it builds a message and passes the message to the application layer on its 
machine. 

e This layer might be a library procedure, for example, but it could also be implemented in 
some other way. 

e The application layer software then adds a header to the front of the message and passes 
the resulting message across the layer 6/7 interface to the presentation layer. 

e The presentation layer in turn adds its own header and passes the result down to the session 
layer, and so on. 
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e Some layers add not only a header to the front, but also a trailer to the end. 
e When it hits bottom, the physical layer actually transmits the message, which by now might 
look as shown in Figure2.4 


Data Link Layer header 

Network Layer header 

Transport Layer header 
Session Layer header 


Presentation Layer 
Application Layer 


Bits that actually appear on the network 


Figure2.4: Message with different header in OSI model 

Layer Features 

Physical Layer =" Transmit raw bits over the communication channel. 

=" Converts sequence of binary digits into electric or light signals. 

=" Deals with mechanical, electrical, procedural and functional 
characteristics of raw bit transmission. 

= Popular physical layer standard for serial communication is RS232-C. 

Data Link Layer | = Detects and corrects errors in transmitted data by partitioning the 
data into frames. 

= Monitors flow control of frames so that the sender transmits the 
message at the speed of reception. 

Network Layer |= Sets upa logical path between two sites for communication. 

=" Encapsulates frames into packets. 

=" Carries out high level addressing and routing. 

= Establishes a virtual circuit between the sender and receiver before 
actual transmission using X.25 , a connection oriented protocol. 

= IP is aconnectionless protocol and hence packets can take 
independent routes. 

Transport Layer | = Provides site to site communication and network independent 
transport service. 

=" Sender converts messages into packets and receiver reassembles 
the packet on reception. 

= Includes mechanism for handling lost and out of sequence packets. 

= Provides reliable ordered delivery of data over a logical connection 
using Transport Control Protocol (TCP), a connection oriented 
protocol. 

=" Provides a connectionless unreliable transport where messages can 
be lost , duplicated or may arrive out of order , using UDP 
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Session Layer 7 


Manages session creation and termination between two 
communicating processes. 

Organizes and synchronizes dialog and manages data exchange after 
authentication. 

Uses dialogs such as one way, two way and two way simultaneous. 
This layer is not required for connectionless communication. 


Presentation 7 
Layer 


Preserves the meaning of message but resolves the syntax 
difference. 

Provides data translation/conversion if the sending and receiving 
computers use different data representation. 

Can involve encryption /decryption for confidential data. 

Carries out compression/decompression in case of large volume of 
data. 


Application 7 
Layer 


Provides services, which directly support end users and are 
application specific. 

Uses protocols form common applications like X.400 (Email 
protocol),X.500(Directory service protocol),FTP(File Transfer 
Protocol),and rlogin (Remote Login Protocol). 


1.4 Internet Protocols 


Table 2.1: OSI/ISO Layers 


Layers Protocols at each layer 


Application FTP, TFTP, TELNET, SMTP, DNS 


Data link layer ARP, RARP 


Physical SLIP, Ethernet, token ring 


Figure2.5: IP suite 


e There two commonly used internet protocols namely Transmission Control Protocol and 


Internet Protocol. 


e TCP/IP is a set of rules (protocols) governing communications among all computers on the 


Internet. 


e TCP/IP dictates how information should be packaged (turned into bundles of information 
called packets), sent, and received, as well as how to get to its destination. 
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e Apart from the lower layer protocols, the IP suite also includes protocols for common 
applications such as electronic mail, terminal emulation and file transfer. 

e The IP suite holds the internet together and is found in most company networks. 

e The IP suite consists of five layers as shown in above Figure 2.5 

e A set of protocols at each layer allows the users flexiblity for different application and 
different communication requirements. 


OSI 
reference model TP suite 


Application NFS 


FTP, Telnet, 


SMTP, SNMP XDR 


TCP, UDP 


Routing protocols} Ip |ICMP 
ARP, RARP 


Not ified 


Figure2.6: The interaction between TCP and IP Suite 


Layer Features 

Physical Layer = Ethernet is the common protocol used at this layer. 

=" Another is SLIP protocol , which uses the RS232-c standard for 
communcation. 

Data Link Layer | = Protocols at this layer are concerned with uniques identification of 
each computer on network. 

= Address Resolution Protocol (ARP) translates IP address to physical 
address , while RARP(Reverse Address Resolution Protocol) translates 
physical address to IP address. 

Network Layer |= This protocol transmists datagrams between hosts. 

= Internet protocols perform fragmentation and reassembly of 
datagrams into a packet. 

= This layer routes the IP packet using a dynamic algorithm with a 
minimal path selection policy. 

= The ICMP (Internet Control Message Protocol) provides flow control 
by informing the sender that the messages are arriving too fast or 
two slow and accordingly very speed of transmission 

Transport Layer | = TCP provides connection oriented reliable byte stream service. 

=" It handles the establishment and termination of connection , 
sequencing of data,end to end reliablity with error control and end to 
end flow control. 
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= UDP is connectionless unreliable data gram service. 

= Both the protocols uses 16 bit port number header attached to data. 

=" When the packet reaches the destination host , the transport layer 
protocol dispatches it to the specified port. 

=" The identified port then picks up the packet. 

Application = FTP (File Transfer Protocol) is used to transfer files back and forth 

Layer between a remote host and a network by executing FTP commands. 

= TFTP (Trivial File Transfr Protocol) is similar to FTP , but it uses UDP 
and cannot check user authenticity and provide directory listing. 

=" TELNET enables terminals and terminal oriented processes to 
communicate with each other using a telent command on the local 
host to carry out a remote login session. 

= ~=SMTP (Simple Mail Transfer Protocol) enables processes on a network 
to exchange electronic mail using TCP connection. 

= DNS (Domain Name Service) is set of symbolic domain names for 
hosts and networks . 

= It is the hierarchical structure independent of the physical structure 
or network. 


Table 2.2: ISO Layers 


1.5 ATM Reference Model 


Other layers not specified in ATM 
protocol reference model 
ATM Adaption Layer (AAL) 


ATM Layer 
Physical Layer 


Figure2.7: ATM Reference model 
e The ATM reference model is divided into three layers (i) Physical Layers (ii) ATM layer and 
(iii) ATM adaption layer. 
e Applications involving data, voice and video are built on top of these layers. 
(a) Physical Layer 
o The physical layer is the bottom most layer of the ATM protocol suite. 
© It has two sub layers: Physical medium dependent (PMD) sub layer and the transmission 
convergence (TC) sub layer. 
o PMD sublayer defines the actual speed for transmission on the physical communication 
medium used. 
o The TC sublayer defines a protocol for the encoding and decoding of cell data into 
suitable electric/Optical waveforms for transmission and reception on the physical 
communication medium define by PMD sublayer. 
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o The physical layer can transfer cells from one user to another in one of the following two 
ways: 
1. By carrying cell as synchronous data stream: 
=" Inthis case, the user network interface UNI, which takes the form of an ATM adapter 
board plugged into a computer, puts out a stream of cells on to a wire or fiber. 
= The transmission stream must be continuous and when there is no data to be sent 
empty cells are transmitted. 
2. By carrying cells in the payload portion of an externally framed transmission structure. 
= UNI uses some standard transmission structure for framing and synchronization at 
the physical layer. 
= SONET is the most commonly used standard for this purpose. 
(b) ATM Layer 


Abits 8 bits 16 bits 3 bits 1 bits 8 bits 48 bits 


See peel ae 


’ Header area si Payload area 


Figure2.8: ATM cell format 
e The ATM layer handles most of the cell processing and routing activities. 
e The ATM layer is independent of the physical medium used to transport the cell. 
e The functionality of the ATM layer is defined by the fields present in the ATM cell header. 
e Above fields and their functions are as follows: 
e Generic flow control (GFC) field. 

o. This field occupies 4 bits in an ATM cell header. 

o The default setting of four 0’s indicates that the cell is uncontrolled. 

o Anuncontrolled cell does not take precedence over other cell when contending for the 
virtual circuit. 

o. The bits in GFC field could be used to prioritize voice over video or to indicate that both 
voice and video take precedence over other types of data. 

e Virtual path identifier and virtual channel identifier. 

o The VPI field occupies 8 bits and the VCI field occupies 16 bits in an ATM cell header. 

o These two fields are used by the routing protocol to determine the paths and channels 
the cell will traverse. 

o When acell arrives at an ATM switch, it’s VPI and VCI values are used to determine the 
new virtual identifiers to be placed in the cell header and the outgoing link over which to 
transmit the cell. 

e Payload Type identifier (PTI) field. 

o. This field occupies 3 bits in an ATM cell header. 

o It is used to distinguish data cells from control cells so that user data and control data 
can be transmitted on different on different sub channels. 

e Cell loss priority field. 
o. This field occupies 1 bit in an ATM cell header. 
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o When set, it indicates that the cell can be discarded, if necessary during periods of 
network congestion. 
o For example voice data may be able to suffer lost cells without need for retransmission, 
whereas text data cannot. 
o Inthis case, an application may set the CLP field of the cells for voice traffic. 
e Header error control (HEC) field. 
o. This field occupies 8 bits in the ATM cell header. 
o It is used to protect the header field from transmission errors. 
© It contains a checksum of only the header (not the payload). 
(c) ATM Adaption Layer. 
e The AAL is responsible for providing different types of service to different types of traffic 
according to their specific requirements. 
e It packages various kinds of user traffic into 48 byte cells, together with the overhead needed 
to meet specific quality of service requirements of different types of traffic. 
e Toreflect spectrum of application , four service classes were defined by ITU 


Service class Class A Class B Class C Class D 
Bit rate type CBR VBR VBR VBR 
Delay Sensitive Yes Yes No No 
Connection Yes Yes Yes No 
Oriented 
AAL protocol to AAL1 AAL2 AAI3/4 or AAI3/4 or 
be used AALS AALS 


Table2.3: Service classes for the ATM Adaption Layer (AAL) 


2 Message Passing 
2.1 Interprocess Communication 
e Interprocess communication (IPC) basically requires information sharing among two or more 
processes. 


Shared common 
memory area 


(b) 
Figure2.9: Interprocess communication (a) Shared data approach (b) Message passing approach 
e Two basic methods for information sharing are as follows: 
o Original sharing, or shared-data approach. 
o Copy sharing , or message-passing approach. 
e Inthe shared-data approach, the information to be shared is placed in a common memory 
area that is accessible to all processes involved in an IPC. 
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e Inthe message-passing approach, the information to be shared is physically copied from the 
sender process’s space to the address space of all the receiver processes. 

e Amessage-passing system is a subsystem of distributed operating system that provides a set 
of message-based IPC protocols, and does so by shielding the details of complex network 
protocols and multiple heterogeneous platforms from programmers. 

e It enables processes to communicate by exchanging messages and allows programs to be 
written by using simple communication primitives, such as send and receive. 


2.2 Desirable Features of a Good Message Passing system 
(a) Simplicity 
o Amessage passing system should be simple and easy to use. 
o It must be straight forward to construct new applications and to communicate with 
existing one by using the primitives provided by message passing system. 
(b) Uniform Semantics 
o Ina distributed system, a message-passing system may be used for the following two 
types of interprocess communication: 
= Local communication, in which the communicating processes are on the same node. 
= Remote communication, in which the communicating processes are on different 
nodes. 
o Semantics of remote communication should be as close as possible to those of local 
communications. 
(c) Efficiency 
o An IPC protocol of a message-passing system can be made efficient by reducing the 
number of message exchanges, as far as practicable, during the communication process. 
o Some optimizations normally adopted for efficiency include the following: 
=" Avoiding the costs of establishing and terminating connections between the same 
pair of processes for each and every message exchange between them. 
=" Minimizing the costs of maintaining the connections; 
= Piggybacking of acknowledgement of previous messages with the next message 
during a connection between a sender and a receiver that involves several message 


exchanges. 
(d) Reliability 

o A good IPC protocol can cope with failure problems and guarantees the delivery of a 
message. 

o Handling of lost messages involves acknowledgement and retransmission on the basis of 
timeouts. 

o A reliable IPC protocol should also be capable of detecting and handling duplicate 
messages. 


(e) Correctness 
o Correctness is a feature related to IPC protocols for group communication. 
o Issues related to correctness are as follows: 
o Atomicity 
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= Atomicity ensures that every message sent to a group of receivers will be delivered 
to either all of them or none of them. 
o Ordered delivery 
=" Ordered delivery ensures that messages arrive to all receivers in an order acceptable 
to the application. 
o Survivability 
=" Survivability guarantees that messages will be correctly delivered despite partial 
failures of processes, machines, or communication links. 


(f) Flexibility 
o Not all applications require the same degree of reliability and correctness of the IPC 
protocol. 


o The IPC protocols of a message passing system must be flexible enough to cater to the 
various needs of different applications. 
o The IPC primitives should be such that user have the flexibility to choose and specify 
types and levels of reliability and correctness requirement of their applications. 
o IPC primitives must also have the flexibility to permit any kind of control flow between 
the cooperating processes, including synchronous and asynchronous send/receive. 
(g) Security 
e A good message passing system must also be capable of providing a secure end to end 
communication. 
e Steps necessary for secure communication include the following: 
o Authentication of the receiver of a message by the sender. 
o Authentication of the sender of a message by its receivers. 
o Encryption of a message before sending it over the network. 
(h)Portability 
o The message passing system should itself be portable. 
o It should be possible to easily construct a new IPC facility on another system by reusing 
the basic design of the existing message passing system. 
o The application written by using primitives of IPC protocols of the message passing 
system should be portable. 
o This may require use of an external data representation format for the communication 
taking place between two or more processes running on computers of different 
architectures. 


2.3. Issues in IPC by Message Passing 
e Amessage is a block of information formatted by a sending process in such a manner that it 
is meaningful to the receiving process. 
e It consists of a fixed-length header and a variable-size collection of typed data objects. 
e The header usually consists of the following elements: 
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Structural information 


Actual data Sequence 
number 


or pointer Number of 


tothe data | bytes/ or message ID process 
elements address jaddress 


}~- Variable- > Fixed-length header on a 


size collection 
of typed data 


Figure2.10: A typical message structure 


e Address. 
o It contains characters that uniquely identify the sending and receiving processes in the 
network. 


e Sequence number. 

oO. This is the message identifier (ID), which is very useful for identifying lost messages and 
duplicate messages in case of system failures. 

e Structural information. 

o This element also has two parts. 

o The type part specifies whether the data to be passed on to the receiver is included 
within the message or the message only contains a pointer to the data, which is stored 
somewhere outside the contiguous portion of the message. 

o Thesecond part of this element specifies the length of the variable-size message data. 

e Inthe design of the IPC protocol for message passing system , the following important issues, 
need to be considered: 

Who is the sender? 

Who is the receiver? 

Is there one receiver or many receivers? 

Is the message guaranteed to have been accepted by receivers? 

Does the sender need to wait for the reply? 

What should be done if the catastrophic event such as node crash or a communication 

link failure occurs during the course of communication? 

o What should be done if the receiver is not ready to accept the message: will the message 
be discarded or stored in a buffer? In the case of buffering what would be done if the 
buffer is full? 

o If there are outstanding messages for a receiver, can it choose the order in which to 
service the outstanding messages? 


0. Oo. 0: © O--O 


3. Synchronization 
e A central issue in the communication structure is the synchronization imposed on the 
communicating processes by the communication primitives. 
e The semantics used for synchronization may by broadly classified as blocking and 
nonblocking types 


Nirali C Dutiya, CE Department | 2160710 — Distributed Operating System 26 


C | Darshan Unit 2 -Communication in Distributed Systems 


Institute of Engineering & Technology 


e Blocking Semantics: 

o Aprimitive is said to have blocking semantics if its invocation blocks the execution of its 
invoker. 

o Incase of a blocking send primitive, after execution of the send statement, the sending 
process is blocked until it receives an acknowledgement from the receiver that the 
message has been received. 

o In blocking receive primitive, after execution of the receive statement, the receiving 
process is blocked until it receives a message. 

e Non Blocking Semantics 

o For nonblocking send primitive, after execution of the send statement, the sending 
process is allowed to proceed with its execution as soon as the message has been copied 
to a buffer. 

o For a nonblocking receive primitive, the receiving process proceeds with its execution 
after execution of the receive statement, which returns control almost immediately just 
after telling the kernel where the message buffer is. 

o Animportant issue in a nonblocking receive primitive is how the receiving process knows 
that the message has arrived in the message buffer. 

o One of the following two methods is commonly used for this purpose: 


Polling. 
= In this method, a test primitive is provided to allow the receiver to check the buffer 
status. 


= The receiver uses this primitive to periodically poll the kernel to check if the message 
is already available in the buffer. 
o Interrupt. 
= In this method, when the message has been filled in the buffer and is ready for use 
by the receiver, a software interrupt is used to notify the receiving process. 
e When both the send and receive primitives of acommunication between two processes use 
blocking semantics, the communication is said to be synchronous, otherwise it is 
asynchronous. 


Client running =  ‘aeaeieead. ##...  °°- Client running 
ient blocke 


Trap to Return from 
kernel, kernel, 
PROCESS Message being sent PFOORSS 


blocked ~—————___ + released 


Client running «———» Client running 


Trap 


Message being sent 


Message copied to kernel buffer 
Figure2.11: Blocking and non blocking primitives 
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3.1 Buffering 
e Inthe standard message passing model, messages can be copied many times: from the user 
buffer to the kernel buffer (the output buffer of a channel), from the kernel buffer of the 
sending computer (process) to the kernel buffer in the receiving computer (the input buffer 
of a channel), and finally from the kernel buffer of the receiving computer (process) to a user 
buffer. 
e Buffering can be of following types. 
(a) Null Buffer (No Buffering) 
o Inthis case, there is no place to temporarily store the message. 
o Hence one of the following implementation strategies may be used: 
= The message remains in the sender process’s address space and the execution of the 
send is delayed until the receiver executes the corresponding receive. 
= The message is simply discarded and the time-out mechanism is used to resend the 
message after a timeout period. 
= The sender may have to try several times before succeeding. 


Sending Receiving 
process process 
(a) 
Sending Receiving 
process process 


Single-message 
buffer 


Node 
boundary 


(b) 


Sending Receiving 
process process 


[Message)| = is MMessaseS} NES] 


Multiple-message 
buffer/mailbox/port 


(c) 
Figure2.12 Message transfer with (a) no buffering (b) single message buffer (c) multi buffer 
(b) Single Message Buffer 
o Insingle-message buffer strategy, a buffer having a capacity to store a single message is 
used on the receiver’s node. 


Nirali C Dutiya, CE Department | 2160710 — Distributed Operating System 28 


@ 
C | Darshan Unit 2 -Communication in Distributed Systems 


Institute of Engineering & Technology 


o. This strategy is usually used for synchronous communication, an application module may 

have at most one message outstanding at a time. 
(c) Finite Bound Buffer or Multi buffer 

o Systems using asynchronous mode of communication use finite-bound buffers, also 
known as multiple-message buffers. 

o Inthis case message is first copied from the sending process’s memory into the receiving 
process’s mailbox and then copied from the mailbox to the receiver’s memory when the 
receiver calls for the message. 

o When the buffer has finite bounds, a strategy is also needed for handling the problem of 
a possible buffer overflow. 

o The buffer overflow problem can be dealt with in one of the following two ways: 

o Unsuccessful communication. 

=" In this method, message transfers simply fail, whenever there is no more buffer 
space and an error is returned. 

oO Flow-controlled communication. 

=" The sender is blocked until the receiver accepts some messages, thus creating space 
in the buffer for new messages. 
o This method introduces a synchronization between the sender and the 
receiver and may result in unexpected deadlocks. 
=" Due to the synchronization imposed, the asynchronous send does not operate in the 
truly asynchronous mode for all send commands. 
(d) Unbounded Capacity Buffer 

o In the asynchronous mode of communication, since a sender does not wait for the 
receiver to be ready, there may be several pending messages that have not yet been 
accepted by the receiver. 

o Therefore, an unbounded-capacity message-buffer that can store all unreceived 
messages is needed to support asynchronous communication with the assurance that all 
the messages sent to the receiver will be delivered. 

o Unbounded capacity of a buffer is practically impossible. 


3.2 Failure Handling in IPC 
e During interprocess communication partial failures such as a node crash or communication 
link failure may lead to the following problems: 
e Loss of request message. 

o This may happen either due to the failure of communication link between the sender 
and receiver or because the receiver’s node is down at the time the request message 
reaches there. 

e Loss of response message. 

o This may happen either due to the failure of communication link between the sender 
and receiver or because the sender’s node is down at the time the response message 
reaches there. 
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Figure2.13: Possible problems in IPC when (a) Request message is lost (b) Response message is lost (c) 
Receiver’s computer crashed. 
e Unsuccessful execution of the request. 
o This may happen due to the receiver’s node crashing while the request is being 
processed. 
e Four-message reliable IPC protocol for client-server communication between two 
processes. 
o The client sends a request message to the server. 
o When the request message is received at the server’s machine, the kernel of that 
machine returns an acknowledgment message to the kernel of the client machine. 
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Figure2.14: Four message reliable IPC protocol for client server 
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o Ifthe acknowledgment is not received within the timeout period, the kernel of the client 
machine retransmits the request message. 

o When the server finishes processing the client’s request it returns a reply message to the 
client. 

o When the reply is received at client machine, the kernel of that machine returns an 
acknowledgment message to the kernel of the server machine. 

o If the acknowledgment message is not received within the timeout period, the kernel of 
the server machine retransmits the reply message. 

e Three message reliable IPC message protocol 

o The client sends a request message to the server. 

o When the server finishes processing the client’s request, it returns a reply message 
(containing the result of processing) to the client. 

o The client remains blocked until the reply is received. 

Client 


Server 
|___neaver | 


Acknowledgment 


eee Blocked state 
Executing state 


Figre2.15: Three message reliable IPC protocol for client server 

o If the reply is not received within the timeout period, the kernel of the client machine 
retransmits the request message. 

© When the reply message is received at the client’s machine, the kernel of that machine 
returns an acknowledgment message to the kernel of the sever machine. 

o If the acknowledgment message is not received within the timeout period, the kernel of 
the server machine retransmits the reply message. 

o If the request message is lost, it will be retransmitted only after the timeout period, 
which has been set to a large value to avoid unnecessary retransmission of the request 
message. 

o On the other hand, if the timeout value is not set properly taking into consideration the 
long time needed for request processing, unnecessary retransmissions of the request 
message will take place. 

e Two message reliable IPC message protocol. 

o The client sends a request message to the server and remains blocked until a reply is 

received from the server. 
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Client Server 


—-—— Blocked state 


Executing state 


Figure 2.16: Two message reliable IPC protocol for client server 
o When the server finishes processing the client’s request, it returns a reply message 
(containing the result of processing) to the client. 
o If the reply is not received within the timeout period, the kernel of the client machine 
retransmits the request message 


3.3. Idempotency and Handling of Duplicate Request Messages 


Client Server balance=1000 
Send 
request Fees 
; Process debit routine 
| GeplOo} Balance = 1000-100=900 
Timeout response Return (success, 900) 
| success 900 
Send Retransmit request ; : 
request Process debit routine 
debit(100) Balance = 900-100=800 
; Return (success, 800) 
Receive response 


balance=800 success 800 
Figure2.17: A non-idempotent routine 
e Idempotency means repeatability. 
e That is an idempotent operation produces the same results without any side effects no 
matter how many times it is performed with the same argument. 
e On the other hand, operations that do not necessarily produce the same results when 
executed repeatedly with the same arguments are said to be non-idempotent. 
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e Consider the following routine of a server process that debits a specified amount from a bank 
account and returns the balance amount to requesting client: 
debit(amount) 
if(balance >= amount) 
{balance =balance-amount; 
return(“success”, balance);} 
else return(“failure”, balance); 
end; 
e Inthe above Figure2.17 first request ask the server to debit an amount of 100 . 
e Server sends reply(success , 900) to the client indicating that the balance remaining is 900. 
e This reply, for some reason, could not be delivered to the client. 
e The client then retransmits the request debit (100) after timeout. 
e The server process the debit (100) request once again and sends a reply(success , 800) , 
indicating that the remaining balance is 800, which is not correct. 
e Exactly one semantics can be used to avoid this type of repeated execution 


identifie De Sen 


request -1 success 900 


Reply cache 
Client Server balance=1000 


Send 
request-1 sa saeie Check reply cache for request -1 
No match found, so process request -1 
debit(100) Save reply 
Timeout response Return (success, 900) 
| success 900 
Send etransmit request -1 
request -1 Check reply cache for request -1 
debit(100) Match found 
Extract reply 
Receive response Return (success, 900) 


balance=900 success 900 


Figure2.18: An example of exactly once semantic 
o As shown in Figure2.18 client makes request-1, the server machine’s kernel receive 
request-1 and the checks the reply cache to see if there is cached reply for request -1. 
o There is no match, hence request is forwarded to appropriate server. 
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The server processes the request and returns the result to kernel. 

Kernel copies the request identifier and result of execution to the reply cache and sends 
result in form of response. 

This response is lost and client times out on request -1 and retransmit request -1. 
Server machine kernel receives request-1 once again and checks the reply cache to see 
if there is a cached reply for request-1. 

This time match is found and cached result is sent as response. 

Thus reprocessing of duplicate request is avoided. 


3.4 Group Communication 
e Depending on single or multiple senders and receivers, the following three types of group 
communication are possible: 


O 
O 
O 


One to many (single sender and multiple receivers). 
Many to one (multiple senders and single receivers). 
Many to many (multiple senders and multiple receivers). 


3.4,1One-to-Many Communication 

e Inthis scheme, there are multiple receivers for a message sent by a single sender. 

e One-to-many scheme is also known as multicast communication. 

e A special case of multicast communication is broadcast communication, in which the 
message is sent to all processors connected to a network. 

a) Group Management 


O 


A closed group is one in which only the members of the group can send a message to the 
group. 

An outside process cannot send a message to the group as a whole, although it may send 
a message to an individual member of the group. 

On the other hand, an open group is one in which any process in the system can send a 
message to the group as a whole 


b) Group Addressing 


O 
O 


A two-level naming scheme is normally used for group addressing. 

The high-level group name is an ASCII string that is independent of the location of the 
processes in the group. 

On the other hand, the low-level group name depends to a large extent on the 
underlying hardware. 

On some networks it is possible to create a special network address to which multiple 
machines can listen. 

Such a network address is called a multicast address. 

Therefore, in such systems a multicast address is used as a low-level name for a group. 
Some networks that do not have the facility to create multicast address may have 
broadcast facility. 

A packet sent to a broadcast address, is automatically delivered to all machines in the 
network. 

The software of each machine must check to see if the packet is intended for it. 
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O 


If a network does not support either the facility to create multicast address or the 
broadcasting facility, a one-to-one communication mechanism has to be used to 
implement the group communication facility. 


c) Message Delivery to Receiver Processes 


O 
O 


User application use high level group names in program. 

The centralized group sever maintains a mapping of high level group names to their low 
level names. 

The group server also maintains a list of process identifiers of all processes for each 
group. 

When sender sends a message to a group specifying its high level name, the kernel of the 
sending machine contacts group server to obtain the low level name of the group and 
list of process identifiers belonging to the group. 

The list of process identifiers is inserted in the message packet. 

If the low level group name is a list of machine identifiers, the kernel sends a copy of the 
packet separately to each machine in the list. 

When the packet reaches a machine, the kernel of that machine extracts the list of 
process identifiers from the packet and forwards the message in the packet to those 
processes in the list that belong to its own machine. 


d) Buffered and Unbuffered Multicast 


O 


O 


O 


For an unbuffered multicast, the message is not buffered for the receiving process and is 
lost if the receiving process is not in a state ready to receive it. 

Therefore, the message is received only by those processes of the multicast group that 
are ready to receive it. 

On the other hand, for a buffered multicast, the message is buffered for the receiving 
process, so each process of the multicast group will eventually receive the message. 


e) Send-to-All and Bulletin-Board Semantics 


O 


Ahamad and Berstain [1985] described the following two types of semantics for one-to- 

many communications: 

=  Send-to-all semantics: A copy of the message is sent to each process of the multicast 
group and message is buffered until it is accepted by the process. 

= Bulletin-board semantics: A message to be multicast is addressed to a channel 
instead of being sent to every individual process of the multicast group. 

=" From a logical point of view, the channel plays the role of a bulletin board. 

=" Areceive process copies the message from the channel instead of removing it when 
it makes a receive request on the channel. 


f) Flexible Reliability in Multicast Communication 


O 
O 


Different applications require different degrees of reliability. 

The sender of a multicast message can specify the number of receivers from which a 
response message is expected. 

In one-to-many communication, the degree of reliability is normally expressed in the 
following forms: 

= The 0-reliability. No response is expected by the sender from any of the receivers. 

= The 1-reliability. The sender expects a response from any of the receivers. 
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The m-out-of-n-reliable. The multicast group consists of n receivers and the sender 
expects a response from m (1<m<n) of the receivers. 

All-reliable. The sender expects a response message from all the receivers of the 
multicast group. 


g) Atomic Multicast 

Atomic multicast has an all-or-nothing property. 

When a message is sent to a group by atomic multicast, it is either received by all the 
surviving processes that are members of the group or else it is not received by any of 
them. 

One method to implement atomic (reliable) multicast is the following: 


O 
O 


Each message has a message identifier field to distinguish it from all other messages 
and a field to indicate that it is an atomic multicast message. 

The sender sends the message to a multicast group. 

The kernel of the sending machine sends the message to all members of the group 
and uses timeout-based retransmissions as in the previous method. 

A process that receives the message checks its message identifier field to see if it is a 
new message. If not, it is simply discarded. 

Otherwise, the receiver checks to see if it is an atomic multicast message. 

If so, the receiver also performs an atomic multicast of the same message, sending it 
to the same multicast group. 

The kernel of this machine treats this message as an ordinary atomic multicast 
message and uses timeout-based retransmission when needed. 

In this way, each receiver of an atomic multicast message will perform an atomic 
multicast of the message to the same multicast group. 


3.4.2 Many-to-One Communication 

In this scheme, multiple senders send messages to a single receiver. 

The single receiver may be selective or nonselective. 

A selective receiver specifies a unique sender; a message exchange takes place only if that 
sender sends a message. 

On the other hand, a nonselective receiver specifies a set of senders, and if any one sender 
in the set sends a message to this receiver, a message exchange takes place 


3.4.3 Many-to-Many Communication 

In this scheme, multiple senders send messages to multiple receivers. 

An important issue related to many-to-many communication scheme is that of ordered 
message delivery. 

Ordered message delivery ensures, that all messages are delivered to all receivers in an order 
acceptable to the application. 
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Time 


Figure2.19: No ordering constraint for message delivery. 


a) Absolute Ordering 


oO 


O 
O 


The absolute ordering semantics ensures that all messages are delivered to all receiver 
processes in the exact order in which they were sent. 

This can be implemented by using global timestamps as message identifiers 

The system is assumed to have a clock at each machine and all clocks are synchronized 
with each other. 

The receiver machine kernel saves all messages from one process into a queue. 
Periodically messages from receiver buffer queues are delivered to the receiver process. 
Other messages remain in the queue for later access. 


ty Time 


t<t 


Figure2.20: Absolute Ordering of message 
S1 sends a message to R1 and R2 at time t1 followed by S2 sending the message to R1 
and R2 at time t2. 
Two timestamps t1 and t2 are attached to messages such that t1<t2. 
Based on absolute ordering semantics, R1 and R2 receive messages m1 and m2 and will 
order them as m1 followed by m2. 


b) Consistent Ordering (Total Ordering) 


O 


O 


Absolute-ordering semantics requires globally synchronized clocks, which are not easy 
to implement. 
Not all application need this type of ordering. 
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S, 


t Time 


t<t 


Figure2.21:Consistent ordering of message 

o This semantics ensures that all messages are delivered to all receiver processes in the 
same order, but they may not be in the order in which they are sent as shown in Figure 
2.21 

o $§1sends a message R1 and R2 at time t1, followed by S2 sending the message R1 and R2 
at time t2. 

o Here ti<t2. 

o Based on consistent ordering semantics R1 and R2 receive either messages m1 followed 
by m2 or m2 followed by m1, even though m1 was sent at a time prior to m2. 

c) Causal Ordering 

o For some applications consistent-ordering semantics is not necessary and even weaker 
semantics is acceptable. 

o This semantics ensure that the two messages are delivered to all receivers in correct 
order, if they are casually related to each other, as shown in Figure2.22 

S; R, R, R S 


Time 


Figure2.22:Casual Ordering of messages 
o $§1sends a message to R1,R2 and R3 at time t1 followed by S2 sending the message to 
R1,R2and R3 at time t2, where t1<t2. 
o After receiving message m1,R1 sends a message m3 to R2 and R3. 
Based on casual ordering semantics m1 and m3 are causally related to each other. 
o Hence they need to be ordered at R1 and R2 , but there is no ordering required for 
messages at R3. 


O 
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4. Client Server Model 
4.1 Basic Concepts 
e The structure of the operation system is like a group of cooperating processes called the 
servers, which offer services to the users called the clients. 
e Bothrun onthe same microkernel as the user processes. 
e Amachine can run as single/multiple clients or single/multiple servers, or a mix of both. 
e This model depicted in Figure2.23 uses the connectionless Request Reply protocol thus 
reducing the overhead of the connection oriented TCP/IP or OSI protocol. 


Request/Repl 


PNY WwW BRAWN 


Network 


(a) (b) 
Figure2.23: Client Server model 
e The Request Reply protocol works as follows : 
o The client requesting the service sends a request message to the server. 
o The server completes the task and returns the result indicating that it has performed the 
task or returns an error message indicating that it has not performed the task. 


Wait for result 
Client 


aS ———————— 7 
Provide service 
Figure2.24: The Client server interaction 


o Figure 2.24 shows the client server interaction 
o Here reply serves as acknowledgement to the request. 

e As seen in the above figure2.23(b), only three layers of the protocol are used. 

e The physical and the data link layer are responsible for transferring packets between the 
client and the server through hardware. 

e There is no need for routing and establishment of connection. 

e The Request Reply protocol defines the set of request and replies to the corresponding 
requests. The session management and other higher layers are not required. 
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4.2 Addressing in Client Server 
e Before the client can send a message to the server, it should know the server address. 


Client Server Client Server 


1: Request to client 1: Broadcast 2: Give own location 
2: Reply to server 3: Request 4: Reply 
(a) Machine addressing (b) Process addressing 


1: Lookup in name server 2: Reply from NS 
3: Request 4: Reply 


(c) Name server technique 


Figure2.25: (a) Machine addressing (b) Process addressing (c) Name server technique 


e There are three main addressing technique: machine addressing, process addressing, and 
name server technique as shown in Figure 2.25. 
Machine Addressing 


O 


O 
1e) 


In this addressing method, the client sends the address as a part of the message, which 
is extracted at the receiving end by the server. 

This method works well if there is only one process running on the server machine. 

If there are multiple processes running on the server, the process ID should be sent as a 
part of the message. 


Process Addressing 


O 
O 
e) 


O 


In this addressing method a message is sent to the processes and not the machines. 

A two part name comprising the machine ID and the process ID is used for addressing. 
The client kernel uses the machine ID to locate the current machine, while the server 
kernel uses the process ID to locate the process on that machine. 

All machines can start the process addresses from zero. 

The system does not need global coordination, since the address comprises the machine 
address and the process address. 

This method is not transparent because the user knows the location of the server. 


Name server technique 


O 
e) 
O 


Broadcasting puts an extra load on the system. 

We can use an extra machine to map ASCII level names to machine addresses. 

The ASCII strings are embedded in the program. This special server is called a name 
server. 

The process addressing techniques are as follows : 

=" Hardwire the machine number into the client code. 
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= Processes pick random addresses and the machine address is located by broadcast 
method. 
= Use a two-level naming scheme with mapping carried out by the name server. 


4.3. Blocking Versus Nonblocking Primitives 

e Ina blocking primitives when a process calls send it specifies a destination and a buffer to 
send to that destination. 

e While the message is being sent, the sending process is blocked (i.e., suspended). 

e The instruction following the call to send is not executed until the message has been 
completely 

e Similarly, a call to receive does not return control until a message has actually been received 
and put in the message buffer pointed to by the parameter. 

e The process remains suspended in receive until a message arrives, even if it takes hours. 

e Analternative to blocking primitives are nonblocking primitives. 

e If send is nonblocking, it returns control to the caller immediately, before the message is 
sent. 

e The advantage of this scheme is that the sending process can continue computing in parallel 
with the message transmission, instead of having the CPU go idle . 

e The disadvantage is that the sender cannot modify the message buffer until the message has 
been sent. 


4.4 Buffered verses Unbuffered primitive 


Address refers toa Server Address refers to a 


mailbox 
Request 


Client 


Network Network 


(a) (b) 
Figure2.26(a)Buffered primitive (b) non buffered primitive 

e Acall receive(addr, &m) tells the kernel of the machine on which it is running that the calling 
process is listening to address addr and is prepared to receive one message sent to that 
address. 

e Asingle message buffer, pointed to by m, is provided to hold the incoming message. 

e When the message comes in, the receiving kernel copies it to the buffer and unblocks the 
receiving process. 

e The use of an address to refer to a specific process is illustrated in Fig. 2.26(a). 

e This scheme works fine as long as the server calls receive before the client calls send. 
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e The call to receive is the mechanism that tells the server's kernel which address the server is 
using and where to put the incoming message. 

e The problem arises when the send is done before the receive. 

e How does the server's kernel know which of its processes (if any) is using the address in the 
newly arrived message, and how does it know where to copy the message? 

e The answer is simple: it does not. 

e Oneimplementation strategy is to just discard the message, let the client time out, and hope 
the server has called receive before the client retransmits. 

e Other approach is to create buffer for storing messages. 

e Buffers have to be allocated, freed, and generally managed. 

e Aconceptually simple way of dealing with this buffer management is to define a new data 
structure called a mailbox. 

e A process that is interested in receiving messages tells the kernel to create a mailbox for it, 
and specifies an address to look for in network packets. 

e Hence, all incoming messages with that address are put in the mailbox. 

e The call to receive now just removes one message from the mailbox, or blocks (assuming 
blocking primitives) if none is present. 

e In this way, the kernel knows what to do with incoming messages and has a place to put 
them. This technique is frequently referred to as a buffered primitive, and is illustrated in 
Fig. 2-26(b). 

e However, mailboxes are finite and can fill up. 


4.5 Reliable versus Unreliable Primitives 


Server 


Client 


Network Network 
(a) (b) 
1. Request (client to server) 1. Request (client to server) 
2. ACK (kernel to kernel) 2. Reply (server to client) 
3. Reply (server to client) 3. ACK (kernel to kernel) 
4. ACK (kernel to kernel) 


Figure2.27: Reliable primitives 
e In case of unreliable primitives, the system gives no guarantee about messages being 
delivered. 
e We simply need to redefine the semantics of send to be unreliable. 
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e The system gives no guarantee about messages being delivered. Implementing reliable 
communication is entirely up to the users. 

e When you drop a letter in a letterbox, the post office does its best to deliver it, but it 
promises nothing. 

e The reliable primitive is to require the kernel on the receiving machine to send an 
acknowledgement back to the kernel on the sending machine. 

e Only when this acknowledgement is received will the sending kernel free the user (client) 
process. 

e The acknowledgement goes from kernel to kernel; neither the client nor the server ever sees 
an acknowledgement. 

e Just as the request from client to server is acknowledged by the server's kernel, the reply 
from the server back to the client is acknowledged by the client's kernel. 

e Thus a request and reply now take four messages, as shown in Figure 2.27(a). 

e Client-server communication is structured as a request from the client to the server followed 
by a reply from the server to the client. 

e Inthis method, the client is blocked after sending a message. 

e The server's kernel does not send back an acknowledgement. 

e Instead, the reply itself acts as the acknowledgement. 

e Thus the sender remains blocked until the reply comes in. 

e If it takes too long, the sending kernel can resend the request to guard against the possibility 
of a lost message. 

e This approach is shown in Figure 2.27(b). 

e Although the reply functions as an acknowledgement for the request, there is no 
acknowledgement for the reply. 


4.6 Implementation of Client Server Model 


Reg 
Ack 
Server 
Repl 
Ack 
(c) (d) 


Figure2.28: Common packet message sequences. 


e All networks have a maximum packet size. 
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e If the message is large, it is split into multiple packets and each packet is transmitted 
individually. 

e What happens if the packets are lost or garbled or arrive at the server out of sequence? 

e Solution is assigning a sequence number to each packet depending on the order of the 
packets. 

e Incase of multiple packets per message, should the receiver acknowledge each packet or 
just the last packet? 

e Ifanacknowledgement is sent for each packet, then only lost packets are retransmitted, but 
it increases the message traffic. 

e The alternative is to send a single acknowledgement for all the packets. 

e This results in transmission of fewer acknowledgements. 

e The client may not understand which packet is lost and hence may end up retransmitting all 
packets. 

e The choice of the method depends on the loss rate of the network. 

e Following are different types of packets transmitted across the network 
o REQ: Request packet is used to send the request from the client to the server. 
o Reply: This message is used to carry the result from the server to the client. 
o ACK: Acknowledgement packet is used to send the correct receipt of the packet to the 

sender. 

o Are You Alive (AYA)?: This packet is sent in case the server takes a long time to complete 

the client's request. 

| am Alive (IAA): The server, if active, replies with this packet. 

It implies that the client kernel should wait for the reply. 

If no response is received for an AYA message, the client resends the message. 

As shown in Figure 2.28, are some of the common packet message sequences. 


Oo 0 0 0 


5. Remote procedure call and implementation issues 


5.1 The RPC Model 
Wait for result 


oe 


Client 


\ 


Call remote Return from 
procedure call 
Request Reply 


Call local procedure 
and return results 
Figure2.29: RPC Model 


e The RPC model uses simple procedure for the transfer of control and data within a program 
in the following manner: 


Nirali C Dutiya, CE Department | 2160710 — Distributed Operating System 44 


@ 
C | Darshan Unit 2 -Communication in Distributed Systems 


Institute of Engineering & Technology 


o For making a procedure call, the caller places arguments to the procedure in some well 
specified location 
o Control is then transferred to the sequence of instructions that constitutes the body of 
the procedure. 
o The procedure body is executed in a newly created execution environment that includes 
copies of the arguments given in the calling instruction. 
o After the procedure’s execution is over, control returns to the calling point, possible 
returning a result. 
e In RPC, since the caller and the callee processes have disjoint address spaces, the remote 
procedure has no access to data and variables of the caller’s environment. 
e The caller sends a call message to the callee and waits for a reply message, the request 
message contains the remote procedure’s parameters, among other things. 
e The server process execution the procedure and then returns the result of procedure 
execution in a reply message to the client process. 
e Once the reply message is received is received, the result of procedure execution is extracted 
and the caller’s execution is resumed. 


5.2 Implementing RPC Mechanism 


Client machine 


Return Call 
Client Stub 


Unpack Pack 


RPCRuntime 


Receive Send 


Server machine 


Server 
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Figure 2.30: Implementation of RPC mechanism 
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e To achieve the goal of semantic transparency, the implementation of an RPC mechanism is 
based on the concept of stubs, which provide a perfectly normal (local) procedure call 
abstraction by concealing from programs the interface to the underlying RPC system. 

e To conceal the interface of the underlying RPC system from both the client and server 
processes, a separate stub procedure is associated with each of the two processes. 

e To hide the existence and functional details of the underlying network, an RPC 
communication package (known as RPCRuntime) is used on both the client and server sides. 

e Implementation of an RPC mechanism usually involves the following five elements of 


program. 
1. The client 

2. The client stub 
3. The RPCRuntime 
4. The server stub 
5. The server 


e The interaction between them is shown in Figure 2.30. 
e The client, the client stub, and one instance of RPCRuntime execute on the client machine, 
while the server, the server stub, and another instance of RPCRuntime execute on the server 


machine. 
e The job of each of these elements is described below. 
(a) Client : 


o The client is a user process that initiates a remote procedure call. 

o To make a remote procedure call, the client makes a perfectly normal local call that 
invokes a corresponding procedure in the client stub. 

(b) Client Stub : 
e The client stub is responsible for carrying out the following two tasks : 

o Onreceipt of a call request from the client, it packs a specification of the target procedure 
and the arguments into a message and then asks the local RPCRuntime to send it to the 
server stub. 

o Onreceipt of the result of procedure execution, it unpacks the result and passes it to the 
client. 

(c) RPCRuntime : 

o The RPCRuntime handles transmission of messages across the network between client 
and server machines. 

o It is responsible for retransmissions, acknowledgements, packet routing, and 
encryption. 

o The RPCRuntime on the client machine receives the call request message from the client 
stub and sends it to the server machine. 

o Italso receives the message containing the result of procedure execution from the server 
machine and passes it to the client stub. 

o On the other hand, the RPCRuntime on the server machine receives the message 
containing the result of procedure execution from the server stub and sends it to the 
client machine. 
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o It also receives the call request message from the client machine and passes it to the 

server stub. 
(d) Server Stub : 

© The job of the server stub is very similar to that of the client stub. 

o It performs the following two tasks : 

o On the receipt of the call request message from the local RPCRuntime, the server stub 
unpacks it and makes a perfectly normal call to invoke the appropriate procedure in the 
server. 

o On receipt of the result of procedure execution from the server, the server stub packs 
the result into a message and then asks the local RPCRuntime to send it to the client stub. 

(e) Server : 

o On receiving a call request from the server stub, the server executes the appropriate 

procedure and returns the result of procedure execution to the server stub. 


5.3 Parameter Passing Semantics 
(a) Call by Value 
o Incall by value method all parameters are copied into a message that is transmitted from 
the client to the server through the intervening network. 
o This poses no problems for simple compact types such as integers, counters, small arrays, 
and so on. 
o. Passing larger data types such as multidimensional arrays, trees and so on, can consume 
much time for transmission of data that may not be used. 
o Thus this method is not suitable for passing parameters involving voluminous data. 
(b) Call by References 
o Most RPC mechanism use the call by value semantics for parameter passing because the 
client and the server exist in different address spaces, possibly even on different types of 
machines, so that passing pointers or passing parameters by reference is meaningless. 
o Afew RPC mechanisms do allow passing of parameters by reference in which pointers to 
the parameters are passed from the client to the server. 
o These are usually closed systems, where a single address space is shared by all processes 
in the system. 
o Inan object based system that uses the RPC mechanism for object invocation, the call by 
reference semantics is known as call by object reference. 


5.4 Call Semantics 
e The different types of call semantics used in RPC systems are described below. 
(a) Possibly or May be Call Semantics 
o. This is the weakest semantics and is not really appropriate to RPC but is mentioned here 
for completeness for completeness. 
o In this method, to prevent the caller from waiting indefinitely for a response from the 
callee a timeout mechanism is used. 
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The caller waits until a predetermined timeout period and then continues with its 
execution. 

Therefore the semantics does not guarantee anything about the receipt of the call 
message or the procedure execution by the caller. 


(b)Last One Call Semantics. 


O 


QO: .0° Oo OO: © 


oO 


It uses the idea of retransmitting the call message based on timeouts until a response it 
received by the caller. 

The calling of the remote procedure by the caller, the execution of the procedure by 
callee, and the return of the result to the caller will eventually be repeated until the result 
of procedure execution is received by the caller. 

Last one semantics can be easily achieved in the way described above when only two 
processors are involved in the RPC. 

Suppose process P1 of node N1 calls procedure F1 on node N2, which in turn calls 
procedure F2 on node N3. 

While process on N3 is working on F2, node N1 crashes. 

Node N1’s processes will be restarted and P1’s call to F1 will be repeated. 

The second invocation of F1 will again call procedure F2 on N3. 

N3 is totally unaware of node N1’s crash. 

Procedure N2 will be executed twice on node N3 and N3 may return the results of two 
execution of F2 in any order, possibly violating last one semantics. 

The basic difficulty in achieving last one semantics in such cases is caused by orphan calls. 
An orphan call is one whose parent has expired due to a node crash. 

To achieve last one semantics, these orphan calls must be terminated before restarting 
the crashed processes. 

This is normally done either by waiting for them to finish or by tracking them down and 
killing them. 


(c)Lost of Many Call Semantics 


O 


O 
O 
O 


A simple way to neglect orphan calls is to use call identifiers to uniquely identify each 
call. 

When a call is repeated, it is assigned a new call identifier. 

Each response message has the corresponding call identifier associated with it. 

A caller accepts a response only if the call identifier associated with it matches with the 
identifier of most recently repeated call, otherwise it ignores the response message. 


(d) At Least Once Call Semantics 


O 
O 


This semantic is weaker than last of many semantics. 

It just guarantees that the call is executed one or more times but does not specify which 
results are returned to caller. 

It can be implemented simply by using timeout based retransmission without caring for 
the orphan calls. 

For nested calls, if there are any orphan calls, it takes the result of the first response 
message and ignores the others, whether or not the accepted response is from an 
orphan. 
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(e) Exactly one semantics. 
o This is the strongest and the most desirable call semantics because it eliminates the 
possibility of a procedure being executed more than once no matter how many time a 
call is retransmitted. 


6. Implementation Issues in RPC 
(a) RPC protocols 

o. The first issue is the choice of the RPC protocol 

o Connection oriented protocol 
= The client is bound to the server and a connection is established between them. 
=" The advantage: communication becomes much easier. 
= The disadvantage: especially over a LAN, is the performance loss. 

o Connectionless protocol 
= IP and UDP are easy to use and fit in well with existing UNIX systems and networks 

such as the Internet, but downside is performance. 

o Packet and message length is another issue. 

o Doing an RPC has a large, fixed overhead, independent of the amount of data sent. 

o Thus reading a 64K file in a single 64K RPC is vastly more efficient than reading it in 64 1K 
RPCs. 

o Itis therefore important that the protocol and network should allow large transmissions. 

o Some RPC systems are limited to small sizes. 

o In addition, many networks cannot handle large packets so a single RPC will have to be 
split over multiple packets, causing extra overhead. 

(b) Acknowledgement 

o If RPCs are broken up into many small packets, a new issue arises: Should individual 
packets be acknowledged or not? 

o Suppose, for example, that a client wants to write a 4K block of data to a file server, but 
the system cannot handle packets larger than 1K. 


4 K Data Client Server Client Server 
0123 Ore 0 -> 
+ ACKO 
1 —E Jib 
<—, ACK1 : 
ah a 
<+—, ACK2 
3 -> i 
1 ACK3 + ACKO0-3 
(a) (b) (c) 


Figure2.31: Acknowledgement to RPC messages 
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o One strategy, known as a stop-and-wait protocol, is for the client to send packet O with 
the first 1K, then wait for an acknowledgement from the server, as illustrated in Fig. 
2.31(b). 

o Then the client sends the second 1K, waits for another acknowledgement, and so on. 

o The alternative, often called a blast protocol, is simply for the client to send all the 
packets as fast as it can. 

o With this method, the server acknowledges the entire message when all the packets have 
been received, not one by one. 

o The blast protocol is illustrated in Figure 2.31(c). 

(c) Critical path 

o The sequence of instructions that is executed on every RPC is called the critical path, and 

is shown in Figure 2.32 


Client machine Server machine 


Perform service 


Client | Call Stub Procedure Server 


Call server 
Setup parameters on stack. 
Unmarshall parameters 


Prepare message buffer 

Client Marshal parameters into buffer 
Sth Fill in message header fields 
Trap to kernel 


Context tee (kernel Context switch to server stub 


Copy message to kernel Copy message to server stub 


Kernel Determine destination address See if stub is waiting ee 
Put address in message header Decide which stub to give it to 


Set up network interface Check packet for validity 
Start timer Process interrupt 


Po 


Figure2.32: Critical path from client to server. 


Server 
Stub 


Kernel 


o Even for the null RPC, the dominant costs are the context switch to the server stub when 
a packet arrives, the interrupt service routine, and moving the packet to the network 
interface for transmission. 

o Forthe 1440-byte RPC, the picture changes considerably, with the Ethernet transmission 
time now being the largest single component, with the time for moving the packet into 
and out of the interface coming in close behind. 

(d) Copying 

o Anissue that frequently dominates RPC execution times is copying. 

o The number of times a message must be copied varies from 1 to about 8, depending on 
the hardware, software, and type of call. 

o Inthe best case, the network chip can DMA the message directly out of the client stub's 
address space onto the network (copy 1), depositing it in the server kernel's memory in 
real. 
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o Inthe worst case, the kernel copies the message from the client stub into a kernel buffer 
for subsequent transmission, either because it is not convenient to transmit directly from 
user space or the network is currently busy (copy 1). 

o Later, the kernel copies the message, in software, to a hardware buffer on the network 
interface board (copy 2). 

o At this point, the hardware is started, causing the packet to be moved over the network 
to the interface board on the destination machine (copy 3). 

o When the packet-arrived interrupt occurs on the server's machine, the kernel copies it 
to a kernel buffer(copy 4). 

o Finally, the message has to be copied to the server stub (copy 5). 

o Suppose that it takes an average of 500 nsec to copy a 32-bit word; then with eight 
copies, each word needs 4 microsec, no matter how fast the network itself is. 

(e) Timer management 

o Huge amount of machine time is wasted in managing the timers. 

o Setting a timer requires building a data structure specifying when the timer is to expire 
and what is to be done after timer expires. 

o The data structure is then inserted into a list consisting of the other pending timers. 

o Usually, the list is kept sorted on time, with the next timeout at the head of the list and 
the most distant one at the end, as shown in Figure 2.33. 

o When an acknowledgement or reply arrives before the timer expires, the timeout entry 
must be located and removed from the list. 

o Inpractice, very few timers actually expire, so most of the work of entering and removing 
a timer from the list is wasted effort. 


Current time Current time 


9 


Process table 


Process 3 


14212 


142716 


(a) (b) 
Figure2.33: (a) Timeouts in a sorted list. (b) Timeouts in the process table. 
o Other way is to maintain a process table, with one entry containing all the information 
about each process in the system. 
o While an RPC is being carried out, the kernel has a pointer to the current process table 
entry in a local variable. 
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o Instead of storing timeouts in a sorted linked list, each process table entry has a field for 
holding its timeout, if any, as shown in Figure 2.33(b). 

o Setting a timer for an RPC now consists of adding the length of the timeout to the current 
time and storing in the process table. 

o Turning a timer off consists of merely storing a zero in the timer field. 

o Thus the actions of setting and clearing timers are now reduced to a few machine 
instructions each. 


7 Case Studies : Sun RPC 


7.1 Stub Generation 
e Sun RPC uses the automatic stub generation approach, although users have the flexibility of 
writing the stubs manually. 
e Anapplication's interface definition is written in an IDL called RPC Language (RPCL). 
e RPCL is an extension of the Sun XDR language that was originally designed for specifying 
external data representations. 
e Aninterface definition contains: 
o A program number (which is 0 x 20000000 in our example) 
o Aversion number of the service (which is 1 in our example). 
© The procedures supported by the service (in our example READ and WRITE). 
o The input and output parameters along with their types for each procedure. 
e The three values program number (STATELESS_FS_PROG), version number (STATELES_FS- 
VERS), and a procedure number (READ or WRITEO uniquely identify a remote procedure. 
e The READ and WRITE procedures are given number 1 and 2, respectively. 
e Interface definition file names have an extension .x (for example, StatelessFS.x). 
/* interface definition for a stateless file service (StatelessFS) in file StatelessFS.x*/ 
const FILE_NAME_SIZE = 16 
const BUFFER_SIZE = 1024 
typedef string File Name<FILE_NAME_SIZE>; 
typedel long Position; 
typedef long Nbytes; 
struct Data ( 
long n; 
char buffer[BUFFER_SIZE); 


}; 

struct readargs { 
FileName filename; 
Position position; 
Nbytes n; 

i 

struct writeargs { 
FileName filename; 
Position position; 
Data data; 
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} 
program STATELESS_FS_PROG { 
version STATELESS _FS_VERS { 


Data READ (readargs) = 1; 
Nbytes WRITE (writeargs) = 2; 
y=1; 
} = 0x20000000; 
This file is manually included in 
Sent and P Only these three files are manually written 
anc sdomatically ine clad 7 =. and edited by an application ers Reale 


client stub, server stub, and XDA 
filters files using #include 


Client program interlace definition written Server program 
written in C in an IDL called RPCL written in C 
IDL compiler called poogen 


Header 
file 


a 
stub filters stub 
Client Cliant stub Server stub Server 
object file object file object file object file 
Linker Client-side Server-side 
APC Runtime RPC Runtime omer 


library lirar 


Client Server 
executable file executable file 


Figure2.34: The steps in creating an RPC application in SUN RPC 


e The IDL compiler is called rpcgen in Sun RPC. From an interface definition file, rpcgen 
generates the following : 
1. A header file that contains definitions of common constants and types defined in the 
interface definition file. 
= It also contains external declarations for all XDR marshaling and unmarshaling 
procedures that are automatically generated. 
=" The name of the header file is formed by taking the base name of the input file to 
rpcgen and adding .h suffix (for example, StatelessFS.h). 
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= This file is manually included in client and server program files and automatically 
included in client stub, server stub, and XDR filters files using #include 
2. An XDR filters file that contains XDR marshalling and unmarshalling procedures. 
= These procedures are used by the client and server stub procedures. 
=" The name of this file is formed by taking the base name of the input file to rocgen 
and adding a_xdr.c suffix (for example, StatelessFS_xdr.c). 
3. A client stub file that contains one stub procedure for each procedure defined in the 
interface definition file. 
="  Aclient stub procedure name is the name of the procedure given in the interface 
definition, converted to lowercase and with an underscore and the version number 
appended. 
= For instance, in our example, the client stub procedure names for READ and WRITE 
procedures will be read_| and write_1, respectively. 
= The names of the client stub file is formed by taking the base name of the input file 
to rpcgen and adding a_clnt.c suffix (for example, StatelessFS_clnt.c). 
4. A server stub file that contains the main routine, the dispatch routine, and one stub 
procedure for each procedure defined in the interface definition file plus a null procedure 
e The main routine creates the transport handles and registers the service. 
e The default is to register the program on both the UDP and TCP transports. However, a user 
can select which transport to use with a command-line option to rpcgen. 
e The dispatch routine dispatches incoming remote procedure calls to the appropriate 
procedure 
e Now using the files generated by rpcgen, an RPC application is created in the following 
manner: 
1. The application programmer manually writes the client program and server program for 
the application. 
The client program file is compiled to get a client object file. 
The server program file is compiled to get a server object file. 
The client stub file and the XDR filters file are compiled to get a client stub object file. 
The server stub file and the XDR filters file are compiled to get a server stub object file. 
The client object file, the client stub object file, and he client-side RPCRuntime library 
and linked together to get the client executable file. 
7. The server object file, the server stub object file, and the server-side RPCRuntime library 
are linked together to get the server executable file. 


Dh OT ee GOS 


7.2 Procedure Arguments 

e In Sun RPC, a remote procedure can accept only one argument and return only one result. 

e Therefore, procedures requiring multiple parameters as input or as output must include 
them as components of a single structure. 

e |faremote procedure does not take an argument, A NULL pointer must still be passed as an 
argument to the remote procedure. 

e Therefore, a Sun RPC call always has two arguments-the first is a pointer to the single 
argument of the remote procedure and the second is a pointer to a client. 
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e On the other hand, a return argument of a procedure is a pointer to the single result. 


7.3. Marshalling Arguments and Results 

e Sun RPC allows arbitrary data structures to be passed as arguments and results. 

e Since significant data representation differences can exist between the client computer and 
the server computer, these data structures are converted to External Data Representation 
(XDR) and back using marshalling procedures. 

e The marshalling procedures to be used are specified by the user and may be either built-in 
procedures supplied in the RPC Runtime library on user-defined procedures defined in terms 
of the built-in procedures. 


7.4 Call Semantics 

e Sun RPC supports at-least-once semantics. 

e After sending a request message, the RPC Runtime library waits for a timeout period for the 
server to reply before retransmitting the request. 

e The number of retries is the total time to wait divided by the timeout period. 

e The total time to wait and the timeout period have default values of 25 and 5 seconds, 
respectively. 

e These default values can be set to different values by the users. 

e Eventually, if no reply is received from the server within the total time to wait, the RPC 
Runtime library returns a timeout error. 


7.5 Client-Server Binding 

e Sun RPC does not have a network wide binding service for client-server binding. 

e Instead, each node has a local binding agent called portmapper that maintains a database of 
mapping of all local services and their port numbers. 

e The portmapper run at a well-known port number on every node. 

e When a server starts up, it registers its program number, version number, and port number 
with the local portmapper. 

e When a client wants to do an RPC, it must first find out the port number of the server that 
supports the remote procedure. 

e For this, the client makes a remote request to the portmapper at the server's host, specifying 
the program number and version number. 

e Aclient must specify the host name of the server when it imports a service interface. 

e The procedure clnt_create is used by a client to import a service interface. 

e It returns a client handle that contains the necessary information for communicating with 
the corresponding server port, such as the socket descriptor and socket address. 

e The client handle is used by the client to directly communicate with the server when making 
subsequent RPCs to procedures of the service interface. 


7.6 Exception Handling 


e The RPCRuntime library of Sun RPC has several procedures for processing detected errors. 
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e The server-side error-handling procedures typically send a reply message back to the client 
side, indicating the detected error. 

e The client-side error-handling procedures provide the flexibility to choose the error- 
reporting mechanism. 

e Errors may be reported to users either by printing error messages to stderr or by returning 
strings containing error messages to clients. 


Security 

e No authentication. 

(e) This is the default type. 

(f) In this case, no attempt is made by the server to check a client's authenticity before 
executing the requested procedure. 

(g) Consequently, clients do not pass any authentication parameters in request messages. 

e UNIX-style authentication. 

(h) This style is used to restrict access to a service to a certain set of users. 

(i) In this case, the uid and gid of the user running the client program are passed in every 
request message, and based on this authentication information, the server decides whether 
to execute the requested procedure or not. 

e DES-style authentication. 

(j) In DES-style authentication, each user has a globally unique name called netname. 

(k) The netname of the user running the client program is passed in encrypted from in every 
request message. 

(I) On the server side, the encrypted netname is first decrypted and then the server uses the 
information in netname to decide whether to execute the requested procedure or not. 


eon study: DCE RPC 


The DCE RPC also uses the automatic stub generation approach. 

e Anapplication's interface definition is written in IDL. 

e Unlike Sun RPC, DCE RPC IDL allows a completely general specification of procedure 
arguments and results. 

e Asshown in the figure, each interface is uniquely identified by a universally unique identifier 
(UUID) that is a 128-bit binary number represented in the IDL file as an ASCII string in 
hexadecimal. 

e The uniqueness of each UUID is ensured by incorporating in it the timestamp and the 
location of creation. 

e AUUID as well as a template for the interface definition is produced by using the uuidgen 

e Therefore, to create the interface definition file for a service, the first step is to call the 
uuidgen program. 

e The automatically generated template file is then manually edited to define the constants 
and the procedure interfaces of the service. 

e When the IDL file is complete, it is compiled using the IDL compiled to generate the client 
and server stubs and a header file. 
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[uuid b20a1705-3c26-12d8-8ea3-04163a0dcefz) 
version (1.0)] 
interface stateless_fs 
( 
const long FILE_NAME_SIZE = 16 
const long BUFFER_SIZE = 1024 
typedef char FileName[FILE, NAME, SIZE]; 
typedef char Buffer[BUFFER_SIZE]; 


void read ( 
[in] File Name filename; 
[in] long position; 
[in, out] long nbytes; 
[out] Buffer buffer; 

} 

void write ( 
[in] FileName filename; 
[in] long position; 
[in, out] long nbytes; 


[in] Buffer buffer; 
); 
} 

e The client and server program are manually written 

e The default call semantics of a remote procedure in DCE RPC is at-most-once semantics. 

e Error recovery is done via a simple retransmission strategy rather than the more complex 
protocol used to implement at-most-once semantics. 

e The DCE RPC has a networkwide binding service for client-server based on its directory. 

e Every cell in a DCE system has a component called Cell Directory Service (CDS), which 
controls the naming environment used within a cell. 

e Moreover, on each DCE server node runs a daemon process called rpcd (RPC daemon) that 
maintains a database of (server, endpoint) entries. 

e Anendpoint is a process address (such as the TCP/IP port number) of a server on its machine. 

e When an application server initializes, it asks the operating system for an endpoint. It then 
register this endpoint with its local rpcd. 

e Atthe time of initialization, the server also registers its host address with the CDS of its cell. 

e When a client makes its first RPC involving the server, the client stub first gets the server's 
host address by interacting with the server(s) of the CDS, making a request to find it a host 
running an instance of the server. 

e It then interacts with the rpcd of the server's host to get the endpoint of the server. 

e The RPC can take place once the server's endpoint is known. 

e Aclient can also do an RPC with a server that belongs to another cell. 

e In this case, the process of getting the server's host address also involves Global Directory 
Service (GDS), which controls the global naming environment outside cell. 
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1. Clock Synchronization 
1.1 Physical Clocks 
e Every Uniprocessor system needs a timer mechanism to keep track of time for process 
execution and accounting for the time spent by the process for using various resources like 
CPU, I/O or even memory. 
e In Distributed system, applications will have several processes running on different machines. 
e Ideally global clock is required to coordinate all such processes. 
e Butin real systems, this is not possible. 
e Each CPU has its own clock and it is required that all clocks in the system displays the same 
time. 


1.2 Implementing Computer Clocks 

e Nearly all computers have a circuit for keeping track of time. 

e Acomputer timer is usually a precisely machined quartz crystal. 

e When kept under tension, quartz crystals oscillate at a well-defined frequency that depends 
on the kind of crystal, how it is cut, and the amount of tension. 

e Associated with each crystal are two registers: a counter and a holding register. 

e Holding register holds some constant value. 

e Counter register is initialized with holding registers value. 

e Each oscillation of the crystal decrements the counter by one. 

e When the counter gets to zero, an interrupt is generated and the counter is reloaded from 
the holding register. 

e Itis possible to program a timer to generate an interrupt 60 times a second, or at any other 
desired frequency. 

e Each interrupt is called one clock tick. 

e When the system is booted initially, it usually asks the operator to enter the date and time, 
which is then converted to the number of ticks after some known starting date and stored in 
memory. 

e At every clock tick, the interrupt service procedure adds one to the time stored in memory. 

e With a single computer and a single clock, it does not matter much if this clock is off by a 
small amount. 

e Assoonas multiple CPUs are introduced, each with its own clock, the situation changes. 

e Although the frequency at which a crystal oscillator runs is usually fairly stable, it is impossible 
to guarantee that the crystals in different computers all run at exactly the same frequency. 

e In practice, when a system has n computers, all n crystals will run at slightly different rates, 
causing the clocks gradually to get out of sync and give different values when read out. 

e This difference in time values is called clock skew. 


1.3 Drifting of Clocks 
e Each machine is assumed to have a timer that causes an interrupt H times a second. 
e When this timer goes off, the interrupt handler adds 1 to a software clock that keeps track of 
the number of ticks (interrupts) since some agreed-upon time in the past. 
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e Let us call the value of this clock C. 

e More specifically, when the UTC time is t, the value of the clock on machine p is Cp(t). 

e Ina perfect world, we would have Cp(t)=t for all p and all t. 

e In other words, dC/dt ideally should be 1. 

e Theoretically, a timer with H = 60 should generate 216,000 ticks per hour. 

e In practice, the relative error obtainable with modern timer chips is about 10°, meaning that 
a particular machine can get a value in the range 215,998 to 216,002 ticks per hour. 

e If there exists some constant such that 


1 ee 
PS ap Pp 


e The timer can be said to be working within its specification. 
e The constant p is specified by the manufacturer and is known as the maximum drift rate. 
e Slow, perfect, and fast clocks are shown in Fig. 3.1 


Clock time, C 


UTC, 
Figure3.1: Drifting of clocks 
e If two clocks are drifting from UTC in the opposite direction, at a time t after they were 
synchronized, they may be as much as 2pAt apart. 
e If the operating system designers want to guarantee that no two clocks ever differ by more 


than 6, clocks must be resynchronized (in software) at least every 6 /2 seconds. 


2. Clock Synchronization Algorithms 


Clock 
synchronization 


Centralized Distributed 


Passive time Active time Global Localized 
server server averaging averaging 


Figure3.2: Classification of Synchronization Algorithms 
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Centralized Algorithms 

e Indistributed system using centralized clock synchronization algorithm , one node has a real 
time receiver (called the time sever node) 

e The clock time of this node is correct and used as the reference time. 

(a) Passive time server 
o Each node periodically sends a message called ‘time=?’ to the time server. 
o When the time server receives the message, it responds with ‘time=T’ message. 


Compensate for delays 
- Note times: 
e request sent: 7p 


e reply received: T,; 
- Assume network delays are symmetric 


Tserver 


Server 
Request 


Client 7 Ty = 
Figure3.3 : Passive time server algorithm 
o Assume that client node has a clock time of To when it sends ‘time=?’ and time Tz when it 
receives the ‘time=T’ message. 
o Toand 7; are measured using the same clock , thus the time needed in propagation of 
message from time server to client node would be (71- To)/2 
o When client node receives the reply from the time server , client node is readjusted to 
Tservert(T1-To)/2 as shown in Figure 3.3 
o Two methods have been proposed to improve estimated value 
1. Let the approximate time taken by the time server to handle the interrupt and process 
the message request message ‘time=?’ is equal to I. 
=" Hence , a better estimate for time taken for prapogation of response message 
from time server node to client node is taken as (T1- To-I)/2 
" Clock is adjusted to the value Tservert(T1- To-I)/2 


T server 


Server 
Request 


Client 
Tg Ty Time 
T,- Tp 


= estimated overhead 
in each direction 


Figure3.4 : Time approximation using passive time server algorithm 
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2. Christian method. 
=" This method assumes that a certain machine, the time server is synchronized to 
the UTC in some fashion called T. 
= Periodically all clock are synchronized with time server. 
=" Other machines send a message to the time server, which responds with T in a 
response, as fast as possible. 
= The interval (Tz- To) is measured many times. 


T2 = T server 


Client 

Ty Time 
in 

Earliest time ___ ___- Latest time 

message arrives —— message leaves 


Range = 74 - Tg - 2T pin 


r 
Accuracy of result = + 


Figure3.5 : Error bounds in Cristian’s algorithm 
=" Those measurements in which (T1- To) exceeds a specific threshold values are 
considered to be unreliable and are discarded. 
= Only those values that fall in the range (T1- To-2Tmin) are considered for calculating 
the correct time. 
= For all the remaining measurements, an average is calculated which is added to T. 
= Alternatively , measurement for which value of (Tz- To) is minimum , is considered 
most accurate and half of its value is added to T. 
(b) Active time server 
o InCristian's algorithm, the time server is passive. 
o Other machines ask it for the time periodically. 
o Allit does is respond to their queries. 


Time daemon 


(a) (b) (c) 
Figure3.6: (a) The time daemon asks all the other machines for their clock values. (b) The machines answer. 
(c) The time daemon tells everyone how to adjust their clock. 
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Here the time server (actually, a time daemon) is active, polling every machine 
periodically to ask what time it is there. 

Based on the answers, it computes an average time and tells all the other machines to 
advance their clocks to the new time or slow their clocks down until some specified 
reduction has been achieved. 

In Figure3.6(a), at 3:00, the time daemon tells the other machines its time and asks for 
theirs. 

In Figure3.6 (b), they respond with how far ahead or behind the time daemon they are. 
Armed with these numbers, the time daemon computes the average and tells each 
machine how to adjust its clock Figure3.6 (c). 


Distributed Algorithms 
(a) Global Averaging algorithm 


O 


O 


One class of decentralized clock synchronization algorithms works by dividing time into 
fixed-length resynchronization intervals. 

The i‘ interval starts at TotiR and runs until To+(i+1) R, where To is an agreed upon 
moment in the past, and R is a system parameter. 

At the beginning of each interval, every machine broadcasts the current time according 
to its clock. 

Because the clocks on different machines do not run at exactly the same speed, these 
broadcasts will not happen precisely simultaneously. 

After a machine broadcasts its time, it starts a local timer to collect all other broadcasts 
that arrive during some interval 5. 

When all the broadcasts arrive, an algorithm is run to compute a new time from them. 
The simplest algorithm is just to average the values from all the other machines. 

A slight variation on this theme is first to discard the m highest and m lowest values, and 
average the rest. 

Discarding the extreme values can be regarded as self defence against up to m faulty 
clocks sending out nonsense. 

Another variation is to try to correct each message by adding to it an estimate of the 
propagation time from the source. 

This estimate can be made from the known topology of the network, or by timing how 
long it takes for probe messages to be echoed. 


(b) Localized averaging distributed algorithm 


O 


The nodes of the distributed system are logically arranged in a specific pattern like a ring 
or grid. 

Periodically each node exchanges its clock time with its neighbours in the logical pattern 
and then, sets its clock time to the average of its own clock time and the clock time of its 
neighbours. 


2.1 Logical Clocks 


e Lamport pointed out that clock synchronization need not be absolute. 
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e If two processes do not interact, it is not necessary that their clocks be synchronized because 
the lack of synchronization would not be observable and thus could not cause problems. 
e Furthermore, he pointed out that what usually matters is not that all processes agree on 
exactly what time it is, but rather, that they agree on the order in which events occur. 
e Tosynchronize logical clocks, Lamport defined a relation called happens-before. 
e The expression a->b is read "a happens before b” and means that all processes agree that 
first event a occurs, then afterward, event b occurs. 
e The happens-before relation can be observed directly in two situations: 
o Ifaand bare events in the same process, and a occurs before b, then a->b is true. 
o If ais the event of a message being sent by one process, and b is the event of the 
message being received by another process, then a->b is also true. 
o Happens —before is a transitive relation, so if a->b and b->c, then a->c. 
e If two events, x and y, happen in different processes that do not exchange messages, then x- 
>y is not true, but neither is y->x. 
e These events are said to be concurrent, which simply means that nothing can be said about 
when they happened or which is first. 


a 


Po 


a,b and a, c are concurrent 
d 


Py 


P2 


c 
c happens after b 


P3 


e happens after a, b, c, d 
Figure3.7 : Happened before relationship and concurrent events 


e We need a way of measuring time such that for every event, a, we can assign it a time 
value C(a) on which all processes agree. 

e Consider the three processes depicted in Fig. 3.8(a). 

e The pee ens run on different machines; each with Ls own clock, running at Le own speed. 


(a) (b} 
Figure3.8: (a) Three processes, each with its own clock. The clocks run at different rates. (b) Lamport's algorithm 
corrects the clocks. 
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e Ascan be seen from the figure, when the clock has ticked 6 times in process 0, it has ticked 8 
times in process 1 and 10 times in process 2. 
e Eachclock runs at a constant rate, but the rates are different due to differences in the crystals. 
e Attime 6, process 0 sends message A to process 1. 
e In any event, the clock in process 1 reads 16 when it arrives. 
e Ifthe message carries the starting time, 6, in it, process 1 will conclude that it took 10 ticks to 
make the journey. 
e According to this reasoning, message B from 1 to 2 takes 16 ticks. 
e Message C from 2 to 1 leaves at 60 and arrives at 56. 
e Similarly, message D from 1 to 0 leaves at 64 and arrives at 54. 
e These values are clearly impossible hence this situation that must be prevented. 
e Lamport's solution, since C left at 60, it must arrive at 61 or later. 
e Therefore, each message carries the sending time, according to the sender's clock. 
e When a message arrives and the receiver's clock shows a value prior to the time the message 
was sent, the receiver fast forwards its clock to be one more than the sending time. 
e In Figure 3.8(b) we see that C now arrives at 61. 
e Similarly, D arrives at 70. 
e Using this method, we now have a way to assign time to all events in a distributed system 
subject to the following conditions: 
1. If ahappens before b in the same process, C(a) < C(b). 
2. If aand b represent the sending and receiving of a message, C(a) < C(b). 
3. For all events a and b, C(a) #C(b). 


3. Mutual Exclusion 

e Systems involving multiple processes are often most easily programmed using critical regions. 

e When a process has to read or update certain shared data structures, it first enters a critical 
region to achieve mutual exclusion and ensure that no other process will use the shared data 
structures at the same time. 

e In single-processor systems, critical regions are protected using semaphores, monitors, and 
similar constructs. 

e Following are the algorithm used for mutual exclusion 


3.1 A Centralized algorithm 

e The most straightforward way to achieve mutual exclusion in a distributed system is to 
simulate how it is done in a one-processor system. 

e One process is elected as the coordinator. 

e Whenever a process wants to enter a critical region, it sends a request message to the 
coordinator stating which critical region it wants to enter and asking for permission. 

e If no other process is currently in that critical region, the coordinator sends back a reply 
granting permission, as shown in Figure. 3.9(a). 

e When the reply arrives, the requesting process enters the critical region. 
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e Now suppose that another process, 2 in Figure. 3.9(b), asks for permission to enter the same 
critical region. 

e The coordinator knows that a different process is already in the critical region, so it cannot 
grant permission. 

e The exact method used to deny permission is system dependent. 

e In Figure. 3.9(b), the coordinator just refrains from replying, thus blocking process 2, which is 
waiting for a reply. 

e Alternatively, it could send a reply saying "permission denied." Either way, it queues the 
request from 2 for the time being. 

e When process 1 exits the critical region, it sends a message to the coordinator releasing its 
exclusive access, as shown in Figure. 3.9(c). 


OFT Ore) @) 


Request OK Request “4 Release OK 
“No reply 
Queue is | 
Coordinator empty 
(a) (b) (c) 


Figure3.9: (a) Process 1 asks the coordinator for permission to enter a critical region. Permission is granted. 
(b) Process 2 then asks permission to enter the same critical region. The coordinator does not reply. (c) 
When process 1 exits the critical region, it tells the coordinator, which then replies to 2. 


e The coordinator takes the first item off the queue of deferred requests and sends that process 
a grant message. 

e If the process was still blocked, it unblocks and enters the critical region. 

e If an explicit message has already been sent denying permission, the process will have to poll 
for incoming traffic, or block later. 

e Either way, when it sees the grant, it can enter the critical region. 

e The centralized approach also has shortcomings. 

e The coordinator is a single point of failure, so if it crashes, the entire system may go down. 

e If processes normally block after making a request, they cannot distinguish a dead 
coordinator from "permission denied" since in both cases no message comes back. 

e In addition, in a large system, a single coordinator can become a performance bottleneck. 


3.2 A Distributed Algorithm / Ricart and Agrawala's algorithm 

e Ricart and Agrawala's algorithm stated that there be a total ordering of all events in the 
system. 

e That is, for any pair of events, such as messages, it must be unambiguous which one 
happened first. 

e When a process wants to enter a critical region, it builds a message containing the name of 
the critical region it wants to enter, its process number, and the current time. 

e It then sends the message to all other processes, conceptually including itself. 
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e When aprocess receives a request message from another process, the action it takes depends 
on its state with respect to the critical region named in the message. 
e Three cases have to be distinguished: 

o If the receiver is not in the critical region and does not want to enter it, it sends back 
an OK message to the sender. 

o If the receiver is already in the critical region, it does not reply,instead, it queues the 
request. 

o If the receiver wants to enter the critical region but has not yet done so, it compares the 
timestamp in the incoming message with the one contained in the message that it has 
sent everyone. 
= The lowest one wins. 
= If the incoming message is lower, the receiver sends back an OK message. 
= If its own message has a lower timestamp, the receiver queues the incoming request 

and sends nothing. 

o After sending out requests asking permission to enter a critical region, a process sits back 
and waits until everyone else has given permission. 

o As soonas all the permissions are in, it may enter the critical region. 

o When it exits the critical region, it sends OK messages to all processes on its queue and 
deletes them all from the queue. 


OK OK 2 
Enters 
1 @ () @ critical 
OK region 


(b) (c) 

Figure3.10 (a) Two processes want to enter the same critical region at the same moment. (b) Process 0 has 
the lowest timestamp, so it wins. (c) When process 0 is done, it sends an OX also, so 2 can now enter the 

critical region. 


e Suppose that two processes try to enter the same critical region simultaneously, as shown in 
Fig. 3.10(a). 

e Process O sends everyone a request with timestamp 8, while at the same time, process 2 
sends everyone a request with timestamp 12. 

e Process 1 is not interested in entering the critical region, so it sends OK to both senders. 

e Processes 0 and 2 both see the conflict and compare timestamps. 

e Process 2 sees that it has lost, so it grants permission to 0 by sending OK. 

e Process 0 now queues the request from 2 for later processing and enters the critical region, 
as shown in Fig. 3.10(b). 

e When it is finished, it removes the request from 2 from its queue and sends an OK message 
to process 2, allowing the latter to enter its critical region, as shown in Figure. 3.10(c). 

e If any process crashes, it will fail to respond to requests. 
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e This silence will be interpreted as denial of permission, thus blocking all subsequent attempts 
by all processes to enter all critical regions. 

e Solution to this is when a request comes in, the receiver always sends a reply, either granting 
or denying permission. 

e Whenever either a request ora reply is lost, the sender times out and keeps trying until either 
a reply comes back or the sender concludes that the destination is dead. 

e After a request is denied, the sender should block waiting for a subsequent OK message. 


3.3 Token Ring Algorithm 
e We have a bus network, as shown in Fig. 3.11(a), (e.g., Ethernet), with no inherent ordering 
of the processes. 
e In software, a logical ring is constructed in which each process is assigned a position in the 
ring, as shown in Fig. 3.11(b). 
e Each process should know who is next in line after itself. 


Token holder may 
enter critical 
region or pass 

the token 


Processes 


Network 


(a) (b) 
Figure 3.11 (a) An unordered group of processes on a network. (b) A logical ring constructed in software. 


e When the ring is initialized, process 0 is given a token. 

e The token circulates around the ring, it is passed from process k to process k+1 (modulo the 
ring size) in point-to-point messages. 

e When a process acquires the token from its neighbour, it checks to see if it is attempting to 
enter a critical region. 

e If so, the process enters the region, does all the work it needs to, and leaves the region. 

e After it has exited, it passes the token along the ring. 

e Itis not permitted to enter a second critical region using the same token. 

e If a process is handed the token by its neighbour and is not interested in entering a critical 
region, it just passes it along. 

e As a consequence, when no processes want to enter any critical regions, the token just 
circulates at high speed around the ring. 

e Only one process has the token at any instant, so only one process can be in a critical region. 
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e Since the token circulates among the processes in a well-defined order, starvation cannot 
occur. 


3.4 Comparison of Various Algorithms 


Algorithm | Messages per entry/exit | Delay before entry Problems 

(in messages times) 
Centralized | 3 2 Coordinator crash 
Distributed | 2(n-1) 2(n-1) Crash of any process 
Token Ring | 1 to Oton-1 Lost token, process crash 


Table3.1: Comparison of all Algorithm 


4. Election Algorithm 
4,1 Bully Algorithm 
e When a process notices that the coordinator is no longer responding to requests, it initiates 
an election. A process, P, holds an election as follows: 
o Psends an ELECTION message to all processes with higher numbers. 
o If no one responds, P wins the election and becomes coordinator. 
o If one of the higher-ups answers, it takes over. P's job is done. 
e At any moment, a process can get an ELECTION message from one of its lower-numbered 
colleagues. 
e When such a message arrives, the receiver sends an OK message back to the sender to 
indicate that he is alive and will take over. 
e The receiver then holds an election, unless it is already holding one. 
e Eventually, all processes give up but one, and that one is the new coordinator. 
e It announces its victory by sending all processes a message telling them that starting 
immediately it is the new coordinator. 
e Thus the biggest guy in town always wins, hence the name "bully algorithm." 


OLMG 


© “© 
xt ® © 
coordinator 
has crashed 
(a) (b) 
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@” “2. Se 
©) (s) (0) (6) 
@) () [spore 


{d) 
Figure3.12: Bully Algorithm.(a) Process 4 holds an election. (b) Processes 5 and 6 respond, telling 4 to stop. 
(c) Now 5 and 6 each hold an election. (d) Process 6 tells 5 to stop. 
(e) Process 6 wins and tells everyone. 
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e In Figure 3.12 we see how the bully algorithm works. 

e The group consists of eight processes, numbered from 0 to 7. 

e Previously process 7 was the coordinator, but it has just crashed. 

e Process 4 is the first one to notice this, so it sends ELECTION messages to all the processes 
higher than it, namely 5, 6, and 7, as shown in Figure 3.12(a). 

e Processes 5 and 6 both respond with OK, as shown in Figure 3.12 (b). 

e Upon getting the first of these responses, 4 knows that its job is over. 

e It knows that one of these bigwigs will take over and become coordinator. 

e = It just sits back and waits to see who the winner will be. 

e In Figure 3.12 (c), both 5 and 6 hold elections, each one only sending messages to those 
processes higher than itself. 

e In Figure 3.12 (d) process 6 tells 5 that it will take over. 

e At this point 6 knows that 7 is dead and that it (6) is the winner. 

e If there is state information to be collected from disk or elsewhere to pick up where the old 
coordinator left off, 6 must now do what is needed. 

e When it is ready to take over, 6 announces this by sending a COORDINATOR message to all 
running processes. 

e When 4 gets this message, it can now continue with the operation it was trying to do when it 
discovered that 7 was dead, but using 6 as the coordinator this time. 

e In this way the failure of 7 is handled and the work can continue. 

e If process 7 is ever restarted, it will just send all the others a COORDINATOR message and 
bully them into submission. 


4.2 A Ring Algorithm 

e Another election algorithm is based on the use of a ring, but without a token. 

e We assume that the processes are physically or logically ordered, so that each process knows 
who its successor is. 

e When any process notices that the coordinator is not functioning, it builds 
an ELECTION message containing its own process number and sends the message to its 
successor. 

e If the successor is down, the sender skips over the successor and goes to the next member 
along the ring, or the one after that, until a running process is located. 

e At each step, the sender adds its own process number to the list in the message. 

e Eventually, the message gets back to the process that started it all. 

e That process recognizes this event when it receives an incoming message containing its own 
process number. 

e At that point, the message type is changed to COORDINATOR and circulated once again, this 
time to inform everyone else who the coordinator is, the list member with the highest 
number and who the members of the new ring are. 

e When this message has circulated once, it is removed and everyone goes back to work. 
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Figure3.13: Election algorithm using a ring. 

e In Figure 3.13 we see what happens if two processes, 2 and 5, discover simultaneously that 
the previous coordinator, process 7, has crashed. 

e Each of these builds an ELECTION message and starts circulating it. 

e Eventually, both messages will go all the way around, and both 2 and 5 will convert them 
into COORDINATOR messages, with exactly the same members and in the same order. 

e When both have gone around again, both will be removed. 

e It does no harm to have extra messages circulating; at most it wastes a little bandwidth 


5. Deadlocks in Distributed System 

e Deadlocks in distributed systems are similar to deadlocks in single-processor systems. 

e They are harder to avoid, prevent, or even detect, and harder to cure when tracked down 
because all the relevant information is scattered over many machines. 

e Various strategies are used to handle deadlocks. 

e Four of the best-known ones are listed below 
1. The ostrich algorithm (ignore the problem). 
2. Avoidance (avoid deadlocks by allocating resources carefully). 
3. Prevention (statically make deadlocks structurally impossible). 
4. Detection (let deadlocks occur, detect them, and try to recover). 


5.1 Deadlock Avoidance 
e Deadlock avoidance techniques use some advance knowledge of the resource usage of 
processes to predict the future state of the system for avoiding allocation that can generally 
lead the deadlock. 
e Deadlock avoidance algorithms are usually work in the following steps: 
e When process request for resources, even resource is available for allocation, resource is not 
immediately allocated to the process. 
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P1 P2 P3 
Max 4 5 6 
Holds 2 2 2 
(1) FREE =2 | 
P1 P2 P3 
Max 4 
Holds 4 2 2 
(2)FREE =0 | 
P1 P2 P3 
Max - 
Holds 0 2 2 
(3)FREE =4 
P1 P2 P3 P1 P2 P3 
Max - 5 6 Max - 5 6 
Holds 0 5 2 Holds 0 2 6 
(4) FREE = 1 | (6) FREE = 0 | 
P1 P2 P3 P1 P2 P3 
Max - - 6 Max : 5 2 
Holds 0 0 2 Holds 0 2 0 
(5) FREE =6 (7) FREE =6 


Figure 3.14 (a) : Safe Sequence 

e Advance knowledge of the resource usage of processes, the systems perform some analysis 
to decide whether granting the process’s request is safe or unsafe. 

e Resource is allocated to the process when the analysis shows that it is safe to do well; 
otherwise request is deferred. 

e Deadlock avoidance algorithms are based on the concept of safe and unsafe states. 

e Asystem is said to be in safe state if it is not in a deadlock state and there exists some ordering 
of them to completion. 

e Any ordering of the processes that can guarantee the completion of all the processes is called 
safe sequence. 

e The concept of safe and unsafe states can be best illustrated with the help of an example. 

e Let us assume that system there are total of 8 units of a particular resource type for three 
processes P1. P2 and P3 are competing. 

e For this state there are two safe sequences (P1, P2, P3) and (P1, P3, P2). 

e Starting from the state of fig (1) the scheduler could simply run P1 exclusively until it asked 
for & got two more units of the resources that are currently free leading to the state of figure 
(2) when P1 complete a &release the resources held by it. 
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We get the state of figure (3) then scheduler chooses to run P2 eventually leading to the state 
of figure (4) when P2 complete and releases the resources held by it, system enter the state 
of figure (5) now available resources P3 can be run to complete. 

The initial state of figure (a) is safe state because the system by careful scheduling can avoid 
deadlock. 

This example also shows that for a particular state there may be more than one safe 
sequence. 

If resource allocation is not done, the system may move from safe state to an unsafe state. 
Let us consider the example shown Figure 3.14(b). 

Figure (8) is same initial state as that of Figure 3.14(a). 

This time suppose process P2 request for one additional unit of the resource free and the 
same is allocated to it by the system. 


P1 P2 P3 
Max 4 5 6 
Holds 2 2 2 
(8) FREE =2 
P1 P2 P3 
Max 4 5 6 
Holds 2 3 2 
(9) FREE=1 


Figure3.14 (b) :Unsafe Sequence 
Then resulting system state is shown in figure (i) is no longer safe state because we have only 
one unit of the resource free. 
Deadlock avoidance algorithm performs resource allocation in such a way as to ensure that 
the system will always remain in safe state. 


5.2 Deadlock Prevention. 
(a) Collective request 


o This method ensures the hold and wait condition by ensuring that whenever a process 
request for resource, it does not hold any other resource. 
o Also a process must collect all its resource before it commences execution. 
o There are mainly two policies: 
1. A process must request all its resource before execution begins. 
= If all resources are available, they are allocated and process runs to completion. 
= If any one of the requested resource is not available , none will be allocated and 
process has to wait 
2. A process can request resource during execution, with the rule that it can request 
resources only if it does not hold any other resources. 
= If the process is holding some resource, it adheres to this rule by releasing the 
resources hold and re-requesting the necessary resources. 
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(b) Ordered request 

o In the circular wait method, each resource type is assigned a unique global number to 
ensure a total ordering of all resource types. 

o The resource allocation policy is such that a process can request a resource at any time, 
but the process must not request a resource with a number lower than the number of 
any of the resource that it is already holding. 

o For example, if a process holds a resource type i, it may request a resource type having 
the number j only if j>i. 

o If a process needs several units of the same resource type, it may issue a single request 
for all the units. 

(c) Preemption 

o Apre-emptable resource is a resource whose state can be saved and restored later. 

o This implies that the resource can be taken away temporarily from the process to which 
it is currently allocated without disturbing the computation performed so far. 

o There are two algorithms for pre-emption: 

1. Wait and die algorithm 
= As shown in figure there are two possibility 
= (a): an old process wants a resource held by a young process. 
= (b): a young process wants a resource held by an old process. 


Wants Holds Wants Holds 
resource resource 


resource resource 


Waits ta) Dies {b) 
Figure3.15: Wait and Die algorithm 

= In one case we should allow the process to wait; in the other we should kill it. 

=" Suppose that we label (a) dies and (b) wait. 

= Then we are killing off an old process trying to use a resource held by a young process, 
which is inefficient. 

=" Under these conditions, the arrows always point in the direction of increasing 
transaction numbers, making cycles impossible. 

= This algorithm is called wait-die. 

2. Wound and wait algorithm 

= This algorithm allows pre-emption. 

=" A young whippersnapper will not pre-empt a venerable old one. 

=" One transaction is supposedly wounded (it is actually killed) and the other waits. 

= If an old process wants a resource held by a young one, the old process pre-empts the 
young one, whose transaction is then killed. 

=" The young one probably starts up again immediately, and tries to acquire the 
resource, forcing it to wait. 
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Figure3.16: Wound and Wait algorithm 
= If an old timer wants a resource held by a young one, the old timer waits politely. 
= If the young one wants a resource held by the old one, the young one is killed. 
= It will undoubtedly start up again and be killed again. 
= This cycle may go on many times before the old one release the resource. 


5.3 Deadlock Detection 
e When a deadlock is detected in a conventional OS, the way to resolve sit is to kill off one or 
more processes. 
e Doing so invariably leads to one or more unhappy users. 
e When a deadlock is detected in a system based on atomic transactions, it is resolved by 
aborting one or more transactions. 
e Following conditions are sufficient and necessary for a deadlock to occur. 
1. Mutual exclusion condition 
= Each resource is assigned to only one process or is available. 
2. Hold and wait condition 
= Process holding the resource can request additional resources without releasing the 
resources which are currently held. 
3. No preemption condition 
= Previously granted resources cannot be forcibly taken away. 
4. Circular wait condition 
= Must be a circular chain of two or more processes 
= Each is waiting for a resource held by the next member of the chain. 
e Following are the algorithms used to detect deadlocks. 
(a) Centralized Deadlock Detection 
o. This algorithm tries to imitate the non-distributed algorithm. 
o Each machine maintains the resource graph for its own processes & resources. 
o Acentral coordinator maintains the resource graph for the entire system. 
o Whenever an arc is added or deleted from the resource graph, a message can be sent to 
the coordinator providing the update. 
o Periodically, every process can send a list of arcs added or deleted since the previous 
update — requires fewer messages than the first one. 
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(a) (b) (c) (d) 


Figure3.17: Centralized Deadlock Detection. 
The coordinator can ask for information when it needs it. 
When the coordinator detects a cycle, it kills off one process to break the deadlock. 
Consider a system with processes A and B running on machine O, and process C on 
machine 1. 
Three resources exist: R, S and T. 
Initially: A holds S but wants R, which it cannot have because B is using it; C has T and 
wants S, too. 
The coordinator's view of the world is shown in Figure 3.17(c). 
This configuration is safe: as soon as B finishes, A can get R and finish, releasing S for C. 
After a while B releases R and asks for T, a perfectly legal and safe swap. 
Machine 0 sends a message to the coordinator announcing the release of R. 
Machine 1 sends a message to the coordinator announcing the fact that B is now waiting 
for its resource, T. 
Unfortunately, the message from machine 1 arrives first, leading the coordinator to 
construct the graph of (d). 
The coordinator incorrectly concludes that a deadlock exists and kills some process. 
Such a situation is called a false deadlock. 


(b) Hierarchical Control 


O 
O 


The hierarchical approach uses logical tree of deadlock detectors called controllers. 

Each controller is responsible for detecting only those deadlock which involve the sites 
within its range in the hierarchy. 

Each site has its own local controller which maintains its own local graph. 

Each controller which forms a leaf of the hierarchy maintains the local WFG of a single 
site. 

Each non- leaf controller maintains a WFG which is the union of WFGs of its immediate 
children in hierarchy. 

Consider a scenario where a distributed system is spread across the entire country as 
shown in Figure 3.18(a) 

There are four cities: Mumbai, Pune, Chennai and Bangalore in states Maharashtra, 
Karnataka and Tamil Nadu. 

There are four controllers Si, S2, S3, and S4situated at the four cities. 
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(d) Global WFG at S, and S, (e) Location-wise hierarchy of controllers 
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{f) Deadlock is detected at the last site with the controller at D 

Figure3.18: Hierarchical deadlock detection 
The WFGs are then sent to the state level controller and state level WFG are combined to 
give a global WFG. 
As shown in Figure3.18 (e) M, P, B and C maintain WFGs at Mumbai, Pune, Bangalore and 
Chennai. 
S1 controls entire Maharashtra and S2 controls Karnataka and Tamil Nadu. 
The global WFG is combined at Delhi. 
There are two resources Riand R2 at Mumbai, R3and Rg at Pune Rs, Re and R7 at Bangalore 
and Rg and Rg at Chennai. 
Processes P1, P2, P3, Pa, Ps, Pe, P7, Ps,and Po are requesting resources. 
Figure 3.18(f) shows that a deadlock cycle is detected at D. 


(c) Distributed Deadlock Detection: Chandy Misra Hass Algorithm. 


O 


O 


In this algorithm processes are allowed to request multiple resources (e.g. locks) at once, 
instead of one at a time. 

By allowing multiple requests simultaneously, the growing phase of a transaction can be 
speeded up considerably. 

The consequence of this change to the model is that a process may now wait on two or 
more resources simultaneously. 

Figure shows a modified resource graph, where only the processes are shown. 

Each arc passes through a resource. 

For simplicity the resources have been omitted from the Figure 3.19 


Machine O Machine 1 Machine 2 


(0,8,0) 


Figure3.19: Chandy Misra Hass Algorithm. 
Process 3 on machine 1 is waiting for two resources, one held by process 4 and one held 
by process 5. 
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Some of the processes are waiting for local resources, such as process 1, but others, such 
are process 2, are waiting for resources that are located on a different machine. 

These cross-machine arcs that make looking for cycles difficult. 

Algorithm is invoked when a process has to wait for some resource, for example, process 
0 blocking on process 1. 

A special probe message is generated and sent to the process (or processes) holding the 
needed resources. 

The message consists of three numbers: the process that just blocked, the process 
sending the message, and the process to whom it is being sent. 

The initial message from to 1 contains the triple (0, 0, 1). 

When the message arrives, the recipient checks to see if it itself is waiting for any 
processes. 

If so, the message is updated, keeping the first field but replacing the second field by its 
own process number and the third one by the number of the process it is waiting for. 
The message is then sent to the process on which it is blocked. 

If it is blocked on multiple processes, all of them are sent (different) messages. 

Remote messages labelled (0, 2, 3), (0, 4, 6), (0, 5, 7), and (0, 8, O). 

If a message goes all the way around and comes back to the original sender, that is, the 
process listed in the first field, a cycle exists and the system is deadlocked. 


5.4 Distributed Deadlock Recovery 
e Following are the techniques that can be used for deadlock recovery. 
(a) Recovery through pre-emption 


O 


O 
O 


The simplest way is to inform the operator that a deadlock has occurred and the processes 
involved in the deadlock. 

The next step would be to take a resource from some other process and handle the 
deadlock. 

This depends on nature of resource. 

One option is to abort all deadlocked processes and the other is to abort one process at 
a time until the deadlock cycle is eliminated. 


(b) Recovery through Rollback 


O 


O 


To break a deadlock, it is sufficient to reclaim the resources from the processes which 
were selected for being killed. 

In order to reclaim the resources, rollback the process to a point where the resource was 
not allocated to the process. 

The processes are check pointed periodically, and process state is written to a file at 
regular intervals. 

If required, the process can be restarted from any of the check points when a deadlock is 
detected and the process is rolled backed to a state as much as is necessary to break the 
deadlock and the necessary resources can be claimed. 


(c) Recovery through killing process 


O 


It involves killing one of the processes in deadlock cycle, reclaiming the resource and 
reallocating them. 
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o Deadlock recovery algorithms analyses the resource requirement and interdependencies 
of the processes involved in the deadlock cycle and select the set of process which if killed 
can break the cycle. 

o The process are chosen such that they can be rerun from the beginning. 


5.5 Issues in Recovery from Deadlock 
(a) Selection of victims 

o Deadlock can be broken by killing or rolling back one of the processes, called victims. 

o A victim can be selected based on two factors: minimization of recovery costs and 
prevention of starvation. 

(b) Minimization of recovery cost 

o Each system should determine their own cost function to select victims. 

o Few factors which may be considered are: the priority of processes, the nature of 
processes, whether possibility of a rerun will have any ill effects, number and types of 
resources held by processes, the length of service already received, total number of 
processes which will be affected. 

(c) Prevention of starvation 

o If the system objective is only to minimize the recovery cost, it is possible that the same 
process will be repeatedly selected as a victim and may never go to completion. 

o. This situation is called starvation and needs to be prevented. 

o One approach which may be adopted to avoid starvation is to raise the priority of the 
process every time is selected as victim. 

o Other include the number of times a process is victimized as a parameter in the cost 
function and then select the victim. 

(d) Use of transaction mechanism 

o After a process is killed or rolled back for recovery from a deadlock, it needs to be rerun. 

o This may not always be safe, because it may have performed some non-idempotent 
operations. 

o Ifa process has updated the amount in a bank account by a credit operation, re-execution 
will result in one more credit and an incorrect state. 
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1. Threads 


1.1 Comparison of Process and threads. 

e In traditional operating systems the basic unit of CPU utilization is a process. 

e Each process has its own program counter, its own register states, its own stack, and its own 
address space. 

e In operating systems with threads facility, the basic unit of CPU utilization is a thread. 

e Each thread of a process has its own program counter, its own register states, and its own 
stack. 

e Allthe threads of a process share of same address space and global variables. 

e Allthreads of a process also share the same set of operating system resources, such as open 
files, child processes, semaphores, signals, accounting information. 


Address space Address space 


Thread Thread] |Thread} | Thread 


| WLI 


Figure4.1: (a) Single threaded and (b) Multithreaded 

e Onauniprocessor, threads run in quasi-parallel (time sharing), whereas on a shared-memory 
multiprocessor, as many threads can run simultaneously as there are processors. 
e Ata particular instance of time, a threads can be in any one of several states: running, 
blocked, ready, or terminated. 

Criteria Process Thread 
Control block Process Control Block (PCB): Thread Control Block(TCB): 
Program counter, stack, register | Program counter, stack, and 
states, open files, child processes, | register states. 
semaphores, and timers. 
Address space Separate for different processes, | Share process address space, 
provides protection among processes. | no protection needed between 
threads belonging to the same 


process. 
Creation Large Small 
overhead 
Context switching | Large Small 
time 
Objective of Resource utilization , to be completive | Use pipeline concept to be 
creation cooperative. 


Table 4.1: Comparison of process and thread. 
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2.3 Thread Models 
1. Dispatcher Worker Model 

o In this model, the process consists of a single dispatcher thread and multiple worker 
threads. 

o The dispatcher thread accepts requests from clients and, after examining the request, 
dispatches the request to one of the free worker threads for further processing of the 
request. 

o Each worker thread works on a different client request. 

o Therefore multiple client requests can be processed in parallel as shown in figure 4.2(a) 


Requests 


A Server process for processing incoming 
Port requests 


Dispatcher Thread 


Figure 4.2(a): Dispatcher worker model 
2. Team Model 
o In this model, all threads behave as equal. 
o Each threads gets and processes clients' requests on its own. 
o This model is often used for implementing specialized threads within a process. 
o Each thread of the process is specialized in servicing a specific type of request. 
Requests 


A Server process for processing incoming 
requests that may be of three different 
types, each types of request being handled 


set by a different thread. 


Thread 


Figure 4.2(b): Team model 
3. Pipeline model. 
o This model is useful for applications based on the producer-consumer model. 
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The output data generated by one part of the application is used as input for another part 
of the application. 
In this model, the threads of a process are organized as a pipeline so that the output data 
generated by the first thread is used for processing by the second thread, the output of 
the second thread is used for processing by the third thread, and so on. 
The output of the last thread in the pipeline is the final output of the process to which the 
threads belong. 

Requests A Server process for processing incoming 
requests, each request processed in three 
steps, each step handled by a different 


thread and output of one step as input to 
the next step 


pe 


Thread 


| 


Thread 


| 


Thread 


| 


Figure 4.2(c): Pipeline model 


2.3 Designing Issues in Thread Package 
1. Thread Creation 


O 
O 


O 


Threads can be created either statically or dynamically. 

In the static approach, the number of threads of a process remains fixed for its entire 
lifetime, while in the dynamic approach, the number of threads of a process keeps 
changing dynamically. 

In the dynamic approach, a process is started with a single thread, new threads are 
created as and when needed during the execution of the process, and a thread may 
destroy itself when it finishes its job by making an exit call. 

On the other hand, in the static approach, the number of threads of a process is decided 
either at the time of writing the corresponding program or when the program is compiled. 


2. Thread Termination 


O 


O 


A thread may either destroy itself when it finishes its job by making an exit call or be killed 
from outside by using the kill command and specifying the thread identifier as its 
parameter. 

In statically created threads, the number of threads remain constant and are never killed. 


3. Thread Synchronization 


O 
O 


Two threads of a process need to increment the same global variable within the process. 
For this to occur safely, each thread must ensure that it has exclusive access for this 
variable for some period of time. 

A segment of code in which a thread may be accessing some shared variable is called a 
critical region. 
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o Two commonly used mutual exclusion techniques in a threads package are mutex 
variables and condition variables. 
4. Thread Scheduling 
oO. Priority assignment facility. 
= In asimple scheduling algorithm, threads are scheduled on a first-in, first-out basis or 
the round-robin policy is used to timeshare the CPU cycles among the threads. 
=" A priority-based threads scheduling scheme may be either non-preemptive or 
preemptive. 
= In nonpremeptive a CPU is not taken away from the assigned thread until some thread 
with higher priority becomes ready to run. 
= In premptive scheme, a higher priority thread always preempts a lower priority one. 
o Flexibility to vary quantum size dynamically 
=" The thread package normally supports dynamic variation of the quantum size. 
= Incase of fixed length quantum technique, a fixed time quantum is assigned to each 
thread for using CPU. 
= Aworst scenario could have a long running thread. 
= This thread is interrupted repeatedly and context switching is carried out when its 
time quantum gets over. 
= It is repeatedly placed in the queue ready to run. 
= In such acase, the size of the time quantum could be varied in the inverse ratio of the 
number of threads in the system. 
o Handoff scheduling 
= In this scheme, a thread can name its successor so that the queue of runnable 
processes can be bypassed. 
=" The scheme allows bypassing runnable threads queue 
o. Affinity scheduling scheme 
= In this scheme, a thread is scheduled on the CPU it last ran on in hopes that part of 
its address space is still in that CPU's cache. 
5. Signal Handling 
o Signals provide software-generated interrupts and exceptions. 
o Interrupts are externally generated disruptions of a thread or process, whereas 
exceptions are caused by the occurrence of unusual condition during a thread's execution. 
o The two main issues associated with handling signals in a multithreaded environment are 
as follows: 
= A signal must be handled properly no matter which thread of the process receives it. 
= Signals must be prevented from getting lost. 


2.4 Implementing a Thread Package 
e Athread package can be implemented either in user space or kernel space. 
1. User level approach 
o A.user level approach shown in Figure 4.3, the user space stores the processes, threads, 
and a runtime system which is a collection of thread management routines. 
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Processes and their 
threads 


User space ; 
Runtime system 


(maintain threads 
status info.) 


Kernel 
Kernel space (maintains processes 
status info.) 


Figure4.3: User level approach 

o This runtime system also maintains the status information table to keep track of the 
current status of each thread. 

o Allcalls of the thread package are implemented as calls to the runtime system procedure. 

o They also perform thread switching if the thread which made the call needs to be 
suspended. 

o The user level implementation uses two level scheduling. 

o Atlevel one, the kernel scheduler allocates the quantum of the process among its threads. 

o The existence of threads is invisible to the kernel because it manages only heavyweight 
processes. 

2. Kernel level approach. 


Processes and their 


User space 
P threads 


Kernel (maintains 


K | 
oe eee threads status info.) 


¥ 


Figure 4.4: Kernel level approach 
In kernel level approach, the threads are managed and maintained by the kernel. 
The thread status information table is stored in the kernel address space. 
Calls which may block a thread are system calls that trap kernel. 
The major function of the kernel is to select the order in which the threads should run, 
based on which process they belongs to. 


O00 0 


2. System Model 
2.1 Workstation Model 


o The workstation model consists of a network of personal computers, each with its own 
hard disk and local file system and interconnected over the network. 
o These are termed as diskful workstations. 
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Workstation 


Workstation 


Figure 4.5:Workstation server model 
Figure 4.5 shows the schematic diagram of a workstation model. 
These workstations are connected in a suitable network configuration using the star 
topology. 
Each workstation can also work as a standalone single-user system. 
In case of educational institutions or companies, such workstations typically tend to be 
idle after working hours and thereby result in a wastage of computing power. 
Users who have logged into their workstation may partition their job, load a partition an 
idle workstation, and utilize the extra processing power to get higher performance. 
Each user has a home workstation in which he logs on and submits a job for execution. 
The distributed operating system may find that the resources of this particular computer 
is not sufficient and may transfer the job to another idle workstation, execute the job and 
transfer the result back to the user's home system. 
This can be done without the user being aware of such a transfer. 


2.2 Workstation server Model 


=e el 


Figure 4.6 : Workstation server Model 

o A workstation does not have its own disk but keeps its files on a server, it is called a 
diskless workstation. 

o Such a workstation is less expensive. 
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The workstation server model consists of multiple workstations coupled with powerful 
servers with extra hardware to store the file systems and other software like databases, 
as shown in Figure 4.6. 

Server processes can control the functioning of the system 

The workstations can be distributed among users. 

The user can log into his home workstations and do the normal computation on the local 
workstations using the services of high end servers as and when required. 

The file system being common to all the workstations, a user can actually log into any one 
of them and carry on his work. 

The local disk may contain temporary files and the really important files can be stored on 
the main server which is better maintained. 

The workstations-server model offers better services than the distributed systems 
consisting of just workstations. 


2.3 Processor Pool Model 


O00 0 0 


Figure 4.7 : Processor Pool Model 
A processor-pool model, as shown in Figure 4.7, consists of multiple processor: a pool of 
processors, and a group of workstations. 
Some of the processors in the pool have more computing power, while others may be 
used as file servers. 
The processors are powerful microcomputers and minicomputers. 
Each processor has its own memory, running system programs or application programs. 
The processors in the pool have no terminals connected directly. 
The users access the system through their terminals via a high speed network. 
The terminals can be diskless workstations or specialized workstations like Graphics 
workstations. 
A specialized server called the run server handles the scheduling of the processors. 
Suppose a user logs into his terminal and is trying to execute a parallel program on X 
number of computers, the run server will allot X number of processors to the job. 
They will be returned to the pool after the parallel job completes. 
The run server decides on which processor the job will run. 
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o If one of the processors in the pool fails, the run server allocates another processor in the 
pool. 


3. Processor Allocation/Task Assignment. 
e A process is considered to be composed of multiple tasks and the goal is to find an optimal 
assignment policy for the task of individual process. 
Task 


assignment 
algorithms 


Graph theoretic Centralized 
deterministic heuristic 
algorithms. algorithm 


Figure:4.8 Task Assignment Algorithms 


3.1 Graphic Theoretic deterministic algorithms 

e Asystem with m CPUs and n processes has any of the following three cases: 
© men: Each process is allocated to one CPU 
Oo m<n: Some CPUs may remain idle or work on earlier allocated processes 
oO mp>n: There is a need to schedule processes on CPUs, and several processes may be 

assigned to each CPU. 

e The main objective of performing CPU assignment is to minimize IPC cost and obtain quick 
turnaround time and achieve high degree of parallelism for efficient utilization of system 
resources and to minimize network traffic. 

e Consider eight processes to be mapped to two processors as shown in figure 


A 4 B 5 c 
5 . | | 4 
D 5 4 F 


Figure:4.9 Task Assignment problem 
e The processes are represented as nodes, A, B, C, D, E, F, G and H. 
e Thearcs between sub-graphs represent network traffic and their weights represent IPC costs. 
e The graph can be partitioned in two ways as follows. 
Partition 1(Figure 4.10) 
e CPU 1runsA,D,G 
e CPU 2runs B,E,F,H and C 
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cPui ' CPU 2 
Figure:4.10 Optimal task assignment with partition 1 


e Network traffic= 4+3+2+5=14 

e Partition 2 (Figure 4.11) 

e CPU 1 runs processes, A, D,E,G and B 
e CPU2 runs processes H,C and F 

e Network traffic= 5+2+4=11 


cPU1 CPU 2 
Figure:4.11 Optimal task assignment with partition 2 
e Thus partition 2 communication incurs less network traffic as compared to partition 1. 


3.2 Centralized heuristic algorithm/Top down algorithm 
Process 1 
Process 1 completes: 


Usage table entry 


Time 


Process 1 4!rives 
allocated 
Figure 4.12: Centralized heuristic task assignment algorithm 


e Asshown in Figure 4.12, coordinator maintains the usage table with one entry for every user 
and this is initially zero. 


e When significant events occur messages are sent to coordinator and the table is updated. 
e Ifthe machine becomes overloaded, machine decides to run process on other machine. 
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e The machine ask usage table coordinator to allocate a new processor to it. 

e Request is granted if processor is available else it is denied temporarily. 

e A zero value in usage table indicates neutral state, a positive value implies that machine is 
the user of system resources and negative values means machine needs resources. 


3.3 Hierarchical Algorithm 


Subardinates 
Figure 4.13: Hierarchical Task Assignment. 
e As shown in Figure 4.13 process hierarchy is modelled like an organization hierarchy. 
e Each processor maintains communication with one superior and few subordinates. 
e If middle level machine fails, next subordinate takes over. 
e Top of tree is truncated and replaced with committee. 
e Incase of failure of any one manager, the other managers in the committee continue the 
operation. 
e The allocation algorithm works between two levels in the group. 


4. Scheduling in Distributed System 
4.1 Desirable Features of a Good Global Scheduling Algorithm 
1. No A Priori knowledge about the Processes 
o A good process scheduling algorithm should operate with absolutely no a priori 
knowledge about the processes to be executed. 
o Obtaining prior knowledge poses an extra burden upon the users who must specify this 
information while submitting their processes for execution. 
2. Dynamic in Nature 
© Process assignment decisions should be based on the current load of the system and not 
on some fixed static policy. 
o Thus system support pre-emptive process migration facility in which a process can be 
migrated from one node to another during the course of its execution. 
3. Quick Decision making capability 
o Agood process scheduling algorithm must make quick decisions about the assignment of 
processes to processors. 
o For example, an algorithm that models the system by a mathematical program and solves 
it on line is unsuitable because it does not meet this requirement. 
o Heuristic methods requiring less computational effort while providing near optimal 
results are therefore normally preferable to exhaustive solution methods. 
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4. Balance System Performance and Scheduling Overhead 


o Several global scheduling algorithms collect global state information and use this 
information in making process assignment decisions. 

o Acommon intuition is that greater amounts of information describing global system state 
allow more intelligent process assignment decisions to be made that have a positive effect 
on the system as a whole. 

o Ina distributed environment, however, information regarding the state of the system is 
typically gathered at a higher cost than in a centralized system. 

o Hence algorithms that provide near optimal system performance with a minimum of 
global state information gathering overhead are desirable. 

5. Stability 

o Ascheduling algorithm is said to be unstable if it can enter a state in which all the nodes 
of the system are spending all of their time migrating processes without accomplishing 
any useful work in an attempt to properly schedule the processes for better performance. 

o. This form of fruitless migration of processes is known as processor thrashing. 

© Processor thrashing can occur in situations where each node of the system has the power 
of scheduling its own processes and scheduling decisions are based on relatively old data 
to transmission delay between nodes. 

o For example, it may happen that node n1 and n2 both observe that node n3 is idle and 
then both offload a portion of their work to node n3 without being aware of the offloading 
decision made by the other. 

o Now if node n3 becomes overloaded due to the processes received from both nodes n1 
and n2, then it may again start transferring its processes to other nodes. 

o. This entire cycle may be repeated again and again, resulting in an unstable state. 

6. Scalable 

o Algorithm should be scalable and able to handle workload inquiry from any number of 
machines in the network. 

o The N? nature of the algorithm creates more network traffic and quickly consumes 
network bandwidth. 

o A simple approach to make an algorithm scalable is to probe only m of N nodes for 
selecting a host. 

o The value of m can be dynamically adjusted depending upon the value of N. 


7. Fault Tolerance. 


O 


O 


A good scheduling algorithm should not be disabled by the crash of one or more nodes of 
the system. 
At any instance of time, it should continue functioning for nodes that are up at that time. 


8. Fairness of Service. 


O 


In any load balancing scheme, heavily loaded nodes will obtain all the benefits while 
tightly loaded nodes will suffer poor response time A fair strategy that improves response 
time of heavily loaded nodes without unduly affecting response time of poorly loaded 
node. 
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5. Load Balancing 
5.1 Load balancing approaches 


Load balancing 
algorithms 


Probabilistic Centralized Distributed 


Non=-Cooperative 


Bete nninistic 


Figure4.14: Load balancing approaches 
Static vs. Dynamic: 
e Static: It use only information about the average behaviour of the system, ignore the current 
state of the system. 
e Dynamic: It responds to the current system state that changes dynamically. 
Static: Deterministic vs. heuristic: 
o Deterministic algorithms are suitable when the process behaviour is known in advance. 
o If all the details like list of processes, computing requirements, file requirements, and 
communication requirements are known prior to execution, then it is possible to make a 
perfect assignment. 
o In the case load is unpredictable or variable from minute to minute or hour to hour, a 
heuristic processor allocation is preferred. 
2. Dynamic: Centralized vs. Distributed: 
o Scheduling decision is made at one single node called centralized server node. 
o This approach can efficiently make process assignment decision, because the centralized 
server knows both the load at each node and the number of processes needing service. 
o The other nodes periodically send status update message to the central server node. 
o In dynamic distributed scheduling algorithms, task of processor assignment is physically 
distributed among various nodes. 
o Dynamic distributed scheduling are more effective than centralized server. 
o Centralized algorithms are liable to a single point of failure, 
3. Cooperative Vs Non cooperative: 
o Nocooperative: In this algorithm, individual entities act as autonomous entities and make 
scheduling decisions independently. 
o Cooperative: In this algorithm, the distributed entities cooperate with each other to make 
scheduling decisions. 
o It is more complex and its stability is better than no cooperative algorithm. 
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5.2 Issues in Designing Load Balancing Algorithms 
5.2.1 Load Estimation policies 


O 


A node's workload can be estimated based on some measurable parameters as follows: 
= Total number of processes on the node at the time of load estimation. 

= Resource demands of these processes. 

= Instruction mixes of these processes. 

= Architecture and speed of the node's processor. 

Several load-balancing algorithms use the total number of processes present on the node 
as a measure of the node's workload. 

Another measure used for estimating a node's workload is the sum of the remaining 
service times of all the processes on that node. 

Other way is measure CPU utilization. 

A machine with 75% CPU utilization is more heavily than that of machine with 40% CPU 
utilization 


5.2.2 Process Transfer Policies 


O 


The strategy of load-balancing algorithms is based on the idea of transferring some 

processes from the heavily loaded nodes to the lightly loaded nodes for processing. 

Most of the load-balancing algorithms use the threshold policy to make this decision. 

Thus a new process at a node is accepted locally for processing if the workload of the 

node is below its threshold value at that time. 

Otherwise, an attempt is made to transfer the process to a lightly loaded node. 

The threshold value of a node may be determined by any of the following methods: 

1. Static Policy: Each node has a predefined threshold value depending on its processing 
capability. 

2. Dynamic Policy: In this method, the threshold value of a node (nj) is calculated as a 
product of the average workload of all the nodes and a predefined constant (ci). 
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Figure 4.15 : The load regions of (a) Single threshold policy and (b) double threshold policy 
In this single-threshold policy, a node accepts new processes (either local or remote) if its 
load is below the threshold value and attempt to transfer local processes and rejects 
remote execution requests if its load is above the threshold value. 
The high-low policy uses two threshold values called high mark and low mark, which 
divide the space of possible load states of a node into the following three regions : 
=" Overloaded — above the high-mark and low-mark values 
=" Normal — above the low-mark value and below the high-mark value 
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=" Underloaded — below both values 
o Anode's load state switches dynamically from one region to another as shown in figure 
4.16 
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Figure 4.16: State transmission diagram of the load of a node in case of double-threshold policy. 
o When the load of the node is in the overloaded region, new local processes are sent to be 
run remotely and requests to accept remote processes are rejected. 
o When the load of the node is in the normal region, new local processes run locally and 
requests to accept remote processes are rejected. 
o When the load of the node is in the underloaded region, new local processes run locally 
and requests to accept remote processes are accept. 


5.2.3 Location Policies 
1. Threshold 
o Inthis method, a destination node is selected at random and a check is made to determine 
whether the transfer of the process to that node would place it in a state that prohibits 
the node to accept remote processes. 
o If not, the process is transferred to the selected node. 
o If the check indicates that the selected node is in a state that prohibits it to accept remote 
processes, another node is selected at random. 
o This continues until either a suitable destination node is found or the number of probes 
exceeds a static probe limit Lp. 
2. Shortest 
o In this method Lp distinct nodes are chosen at random, and each is polled in turn to 
determine its load. 
o The process is transferred to the node having the minimum load value, unless that node's 
load is such that it prohibits the node to accept remote processes. 
o If none of the polled nodes can accept the process, it is executed at its originating node. 
o If adestination node is found and the process is transferred there, the destination node 
must execute the process regardless of its state at the time the process actually arrives. 
3. Bidding 
o Each node in the network is responsible for two roles with respect to the bidding process: 
manager and contractor. 
o The manager represents a node having a process in need of a location to execute, and the 
contractor represents a node that is able to accept remote processes. 
o Asingle node takes on both these roles and no nodes are strictly managers or contractors 
alone. 
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o To select a node for its process, the manager broadcasts a request-for-bids message to all 
other nodes in the system. 

o Upon receiving this message, the contractor nodes return bids to the manager node. 

o The bids contain the quoted prices, which vary based on the processing capability, 
memory size, resource availability, and so on of the contractor nodes. 

o Of the bids received from the contractor nodes, the manger node chooses the best bid. 

o Once the best bid is determined, the process is transferred from the manager node to the 
winning contractor node. 

o But it is possible that a contractor node may simultaneously win many bids from many 
other manager nodes and thus become overloaded. 

o To prevent this situation, when the best bid is selected, a message is sent to the owner of 
that bid. 

o At that point the bidder may choose to accept or reject that process. 

o A message is sent back to the concerned manager node informing it as to whether the 
process has been accepted or rejected. 

o If the bid is rejected, the bidding procedure is started all over again. 

4. Pairing 

o The method of accomplishing load balancing by the pairing policy is to reduce the variance 
of loads only between pairs of nodes of the system. 

o In this method, two nodes that differ greatly in load are temporarily paired with each 
other, and the load-balancing operation is carried out between the nodes belonging to 
the same pair by migrating one or more processes from the more heavily loaded node to 
the other node. 


5.2.4 State Information Exchange 
1. Periodic Broadcast 
o Inthis method each node broadcast its state information after the elapse of every t units 
of time. 
o This method has two disadvantages (a) it generates heavy traffic and (b) it is not scalable. 
2. Broadcast when state changes 
o Node broadcast state information only when state of the node changes. 
o A node's state changes when a process arrives at that nodes or when a process departs 
from the node. 
o A process may arrive at a node either from the external world or from some other node 
in the system. 
o A process departs from a node when either its execution is over or it is transferred to 
some other node. 
3. On Demand Exchange 
o State Information Request message is sent by node, when its state switches from the 
normal load region to either the underloaded region or the overloaded region. 
o On receiving this message, other nodes send their current state to the requesting node. 
o This method works with the two-threshold transfer policy. 
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o On receiving the StatelnformationRequest message, only those nodes reply that can 
cooperate with the requesting node in the load-balancing process. 
o. That is if the requesting node is underloaded only overloaded nodes can cooperate with 
it in the load-balancing process and vice versa. 
4. Exchange by Polling 
o When anode needs the cooperation of some other node for load balancing, it can search 
for a suitable partner by randomly polling the other nodes one by one. 
o The polling process stops either when a suitable partner is found or a predefined poll limit 
is reached. 
5. Priority Assignment Policy 
o Following are the priority assignment rules for scheduling both local and remote 
processes at a particular node. 
= Selfish: Local processes are given higher priority than remote processes. 
=  Altruistic: Remote processes are given higher priority than local processes. 
= Intermediate: The priority of processes depends on the number of local processes and 
the number of remote processes at the concerned node. 
o The selfish priority assignment rule gives worst response time performance, while 
altruistic assignment rule achieves best response time performance. 
o The performance of intermediate priority assignment rule falls between the other two 
policies. 
6. Migration Limiting Policies 
o Uncontrolled 
= In this case, a remote process arriving at anode is treated just as a process originating 
at the node. 
= Therefore, under this policy, a process may be migrated any number of times. 
o Controlled 
= Most system treat remote processes different from local processes and use a 
migration count parameter to fix a limit on the number of times that a process may 
migrate. 


6. Load Sharing Approaches 
6.1 Issues in Designing Load Sharing Algorithms 
6.1.1 Load Estimation Policies 
o Simple load estimation policy of counting the total number of processes on a node is not 
suitable for use is modern distributed systems. 
o Therefore, measuring CPU utilization is used as a method of load estimation in these 
systems. 
6.1.2 Process Transfer Policies 
o Load sharing approach uses all or nothing strategy. 
o This strategy uses the single-threshold policy with the threshold value of all the nodes 
fixed at 1. 


Nirali C Dutiya, CE Department | 2160710 — Distributed Operating System 95 


@ 
c | Darshan Unit 4—Processes and Processors 


Institute of Engineering & Technology 


o That is, a node becomes a candidate for accepting a remote process only when it has no 
process, and a node becomes a candidate for transferring a process as soon as it has more 
than one process. 

o To improve this strategy pre-emptive transfer is made to nodes that are not idle but are 
expected to soon become idle. 

o Some load sharing algorithms use a threshold value 2 instead of 1. 

6.1.3 Location Policies 
o In load-sharing algorithms, the location policy decides the sender node or the receiver 
node of a process that is to be moved within the system for load sharing. 
o Following are the location policies: 
1. Sender-initiated policy: 
=" The sender node of the process decides where to send the process. 
= Inthe sender-initiated location policy, heavily loaded nodes search for lightly, loaded 
nodes to which work may be transferred. 

=" Anode is a viable candidate for receiving a process from the sender node only if the 
transfer of the process to that node would not increase the receiver node's load above 
its threshold value. 

2. Receiver-initiated policy, 
= The receiver node of the process decides from where to get the process. 
= Inthe receiver-initiated location policy, lightly loaded nodes search for heavily loaded 

nodes from which work may be transferred. 
=" Anode is viable candidate for sending one of its processes only if the transfer of the 
process from that node would not reduce its load below the threshold value. 
6.1.4 State Information Exchange policies 
o A node needs to know the state of other nodes only when it is either underloaded or 
overloaded 
o Therefore, in load-sharing algorithms, a node normally exchange state information with 
other nodes only when its state changes. 
o The two commonly used policies for this purpose are described below. 
1. Broadcast When State Changes. 
= In this method, a node broadcasts a StatelnformationRequest message when it 
becomes either underloaded or overloaded. 

= In the sender-initiated policy, a node broadcasts this message only when it becomes 
overloaded and in the receiver-initiated policy, this message is broadcast by a node 
when it becomes underloaded. 
2. Poll When State Changes. 
= In this method, when a node's state changes, it does not exchange state information 
with all other nodes but randomly polls the other nodes one by one and exchanges 
state information with the polled nodes. 

= The state exchange process stops either when a suitable node for sharing load with 
the probing node has been found or the number of probes has reached the probe 
limit. 
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= Insender-initiated policy, poling is done by a node when it becomes overloaded, and 
in receiver-initiated policy, polling is done by a node when it becomes underloaded. 


7. Fault Tolerance 
7.1 Component Failures 
e A fault isa malfunction caused by errors due to design, programming, manufacturing, physical 
damage, ageing, etc. 
e Transient Faults: These fault occurs suddenly, disappear and may not occur again if the 
operation is repeated. 
e Intermittent faults: These faults occur often, but may not be periodic in nature. 
e Permanent faults: These faults are easily identified and the component can be replaced. 
e The expected time tofail can be calculated as follow: 
Mean-time of failure=);7_,(1 — p)4 
e Using standard infinite sum equation and differentiating on both sides with respect tO p we 
get Mean time to failure = 1/p 


7.2 System Failures 

e Hardware or software faults can be classified as fail-silent faults and Byzantine fault. 

e Fail Silent Faults: A faulty processor stops responding to any input and does not produce any 
output. 
o Also called fail stop faults 

e Byzantine Faults: when Byzantine fault occur, the system continues to run but produces 
wrong outputs. 
o The system may work maliciously with faulty processors giving an impression of correct 

operation. 


7.3 Use of Redundancy 
e Techniques to handle fault tolerance are: 
e Information redundancy: Extra bit are added with data transmitted to allow recovery from 
garbled bits to detect and correct errors. 
o Some commonly used error detecting and correcting codes are parity, CRC and hamming 
code. 
e Time redundancy: An Action performed once is repeated if needed after a specific time 
period. 
o For example, if an atomic transaction aborts, it can be executed without any side effects. 
e Physical redundancy: Extra equipment’s is added to enable the system to tolerate faults due 
to loss or malfunction of some components. 
o For example extra stand by processors can be used in the system. 
o If one of the processors crashes, the standby processors takes the load and system 
continues to function normally. 
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8. Real Time Distributed System 

e Distributed real-time systems can often be structured as illustrated in Fig. 4.17. 

e Here a collection of computers connected by a network. 

e Some of these are connected to external devices that produce or accept data or expect to be 
controlled in real time. 

e The computers may be tiny microcontrollers built into the devices, or stand-alone machines. 

e In both cases they usually have sensors for receiving signals from the devices and/or actuators 
for sending signals to them. 


e The sensors and actuators may be digital or analog. 
External 
device 


Network 


Figure4.17:Real time system 


e Real-time systems are generally split into two types depending on how serious their deadlines 
are and the consequences of missing one. 
1. Soft real time system: missing an occasional deadline is all right. 
o For example, a telephone switch might be permitted to lose or misroute one call in 
10° under overload conditions and still be within specification. 
2. Hard real time system: even a single missed deadline in a hard real-time system is 
unacceptable, as this might lead to loss of life or an environmental catastrophe 


9. Process Migration and related issues 
Process migration is the relocation of a process from its current location (the source node) to 
another node (the destination node). 
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Figure4.18: Flow of execution of a migrating process 
e A process may be migrated either before it starts executing on its source node (non pre- 
emptive migration) or during the course of its execution (pre-emptive migration). 
e Process migration involves the following major steps : 
o Selection of a process that should be migrated. 
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o Selection of the destination node to which the selected process should be migrated. 
o Actual transfer of the selected process to the destination node. 


9.1. Desirable Features of Good Process Migration Mechanism 
1. Transparency 
o Transparency is an important requirement for a system that supports process migration. 
© Object access level. 
=" Object access level transparency is the minimum requirement for a system to support 
non preemptive process migration 
= Access to objects such as files and devices can be done in a location independent 
manner. 
=" The object access level transparency allows, free initiation of programs at an arbitrary 
node. 
o System call and interprocess communication 
= System call and interprocess communication level transparency is must for a system 
to support premptive process migration facility. 
=" Transparency of interprocess communication is also desired for the transparent 
redirection of messages during the transient state of process that recently migrated. 
o Minimal Interface 
=" Migration of a process should cause minimal interference to the progress of the 
process involved and to the system as a whole. 
= This can be achieved by minimizing the freezing time of the process being migrated. 
o Minimal Residual Dependencies 
= A migrated process should not in any way continue to depend on its previous node 
once it has started executing on its new node since, otherwise, the following will 
occur. 
o Efficiency 
= It is the major issue in implementing process migration. 
=" Main sources of inefficiency are: time required to migrate a process, the cost of 
locating an object and the code of supporting remote execution once the process is 
migrated. 
o Robustness 
=" The failure of anode other than the one on which a process is currently running should 
not in any way affect the accessibility or execution of that process. 
o Communication between coprocesses of a job 
= Benefit of process migration is the parallel processing among the processes of a single 
job distributed over several nodes. 
=" Coprocesses are able to directly communicate with each other irrespective of their 
locations. 
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9.2 Process Migration Mechanism 
1. Mechanisms for Freezing and Restarting a Process 
o In preemptive process migration, at some point during migration, the process is frozen on 
its source node its state information is transferred to its destination node. 
o The process is restarted on its destination node using this state information. 
o Immediate and Delayed Blocking of the Process : 
=" Depending upon the process's current state, it may be blocked immediately or the 
blocking may have to be delayed until the process reaches a state when it can be 
blocked. 

(a) If the process is not executing a system call, it can be immediately blocked from 
further execution. 

(b) If the process is executing a system call but is sleeping at an interruptible priority 
waiting for a kernel event to occur, it can be immediately blocked from further 
execution. 

(c) If the process is executing a system call and is sleeping at non interruptible priority 
waiting for a kernel event to occur, it cannot be blocked immediately. 

© Fast and Slow I/O operations 
= The process is frozen after the completion of all fast |/O operations. 
« It is feasible to wait for fast |/O operations to complete before freezing the process. 
= It is not feasible to wait for slow I/o operations such as those on a pipe or terminal 
because the process must be frozen in a timely manner for the effectiveness of 
process migration 
o Information about Open Files 
=" A process's state information also consists of the information pertaining to files 
currently open by the process. 
=" This includes information such as the names or identifiers of the files, their access 
modes, and the current positions of their file pointes. 
o Reinitiating the Process on its Destination Node 
=" On the destination node, an empty process state is created that is similar to that 
allocated during process creation. 
= Depending upon the implementation, the newly allocated process may or may not 
have the same process identifier as the migrating process. 
=" This newly created copy of the process initially has a process identifier different from 
the migrating process in order to allow both the old copy and the new copy to exist 
and be accessible at the same time. 
= Ifthe process identifier of the new copy of the process is different from its old copy, 
the new copy's identifier is changed to the original identifier in a subsequent step 
before the process starts executing on the destination node. 
=" Once all the state of the migrating process has been transferred from the source node 
to the destination node and copied into the empty process state, the new copy of the 
process is unfrozen and the old copy is deleted. 
=" The process is restarted on its destination node in whatever state it was in before 
being migrated. 
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2. Address Space Transfer Mechanism 
o Total Freezing 
= In this method, a process's execution is stopped while its address space is being 
transferred as shown in figure 4.19 
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Figure 4.19: Total Freezing mechanism 
= Its disadvantage is that if a process is suspended for a long time during migration, 
timeouts may occur, and if the process is interactive, the delay will be noticed by the 
user. 
© Pretransfeering 
= In this method, the address space is transferred while the process is still running on 
the source node as shown in figure 4.20 
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Figure 4.20: Pretransfer mechanism 

=" Address space transferred while process is running on the source node 

= After decision for migration is made process continues to execute on source node until 
address space is has been transferred. 

= Initially entire address space is transferred followed by repeated transfers of pages 
modified during previous transfer so on until no reduction in number of pages is 
achieved. 

= The remaining pages are transferred after the process is frozen for transferring its 
state information. 

=" Freezing time is reduced, migration time may increase due to possibility of redundant 
transfer of same pages as they become dirty while pretransfer is being done 
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Figure 4.21: Transfer on reference mechanism. 

= In this method, the process's address space is left behind on its source node. 

=" The relocated process executes on its destination node, attempts to reference 
memory pages results in the generation of requests to copy in the desired blocks from 
their remote locations. 

= In this demand driven copy on reference approach, a page of the migrant process's 
address space is transferred from its source node to its destination node only when 
referenced. 

=" Failure of source node results in failure of process. 


3. Message Forwarding Mechanism 


O 


The message to be forwarded to the migrant process's new location can be classified into 

the following : 

Type-1: Messages received at the source node after the process's execution has been 

stopped on its source node and the process's execution has not yet been started on its 

destination node. 

Type-2: Messages received at the source node after the process's execution has started 

on its destination node. 

Type-3: Messages that are to be sent to the migrant process from any other node after it 

has started executing on the destination node. 

Mechanism of Resending the Message 

= Messages of types 1 and 2 are returned to the sender as not deliverable or are simply 
dropped 

= The sender of the message is stores a copy of the data and is prepared to retransmit 
it. 

= Sender retries after locating the new node (using locate operation) 

= Type 3 message directly sent to new node. 

Origin site Mechanism 

= The process identifier of these systems has the process's origin site embedded in if, 
and each site is responsible for keeping information about the current locations of all 
the processes created on it. 

= In these systems, messages, for a particular process are always first sent to its origin 
site. 

= The origin site then forwards the message to the process's current location. 
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=" There is continuous load on the migrant process's origin site even after the process 
has migrated from that node. 
o. Link Traversal Mechanism 
=" To redirect the messages of type 1, a message queue for the migrant process is 
created on its source node. 
= All messages in the queue are sent to the destination node as a part of the migration 
procedure. 
= To redirect the messages of type 2 and 3, a forwarding address known as link is left at 
the source node pointing to the destination node of the migrant process. 
= The most important part of a link is the message process address that has two 
components. 
= The first component is a system wide, unique, process, identifier. 
=" The second component is the last known location of the process. 
= To forward messages of types 2 and 3, a migrated process is located by traversing a 
series of links that form a chain ultimately leading to the process's current location. 
o Link Update Mechanism : 
=" During the transfer phase, the source node sends link — update messages to the 
kernels controlling all of the migrant process's communication partners. 
= These link update messages tell the new address of each link held by the migrant 
process and are acknowledged for synchronization purposes. 
= Messages of types 1 and 2 are forwarded to the destination node by the source node 
and messages of type 3 are sent directly to the process's destination node. 
4. Mechanism for Handling Coprocesses 
o Disallowing Separation of Coprocesses 
= Process will not be migrated unless their children complete execution. 
= Parent and child process are migrated together. 
o Home Node or Origin Site Concept 
= It is deployed in the Sprite system, which allows complete freedom to migrating 
process. 
= It implies that each process has a home node but it can execute independently on 
different nodes of the system. 
= It increases message traffic and communication cost. 


9.3. Process Migration in Heterogeneous system. 
o Process migration is simple and easy in homogenous systems as data is consistent and 
compatible on the source and destination nodes. 
© But in case of heterogeneous system for n types of CPUs, n(n-1) types of translation 
software are needed to support migration. 
o Thus system become complex and undesirable. 
In order to decrease this complexity , use of XDR is done , as shown in figure 4.22 
o Inthis mechanism, a standard representation is used for the transport of data, and each 
processor needs only to be able to covert data to and from the standard form. 


oO 
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Processor of 


Figure 4.22: Example illustrating the need for 12 pieces of translation software required ina 
heterogeneous system having 4 types of processors. . 
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Figure 4.23: Example illustrating the need for only 8 pieces of translation software in a heterogeneous 
system 
o The process of converting from a particular machine representation to external data 
representation format is called sterilizing, and the reverse process is called desterilizing. 
o Care has to be taken for handling floating point numbers because it contains exponent, 
mantissa and sign bits. 
o Handling Exponent 
= The number of bits used for the exponent of a floating point number varies from 
processor to processor. 
= Consider for the exponent processor, A uses 8 bits, processor B uses 16 bits, and the 
external data representation designed by the users of processors architecture, A 
provides 12 bits. 
= In this situation, a process can be migrated from processor A to B without any problem 
in representing floating point numbers. 
= Because the two step translation process of the exponent involves the conversion of 
8 bits of data to 12 bits and then 12 bits of data to 16 bits, 
= Buta process that has some floating point data whose exponent requires more than 
12 bits cannot be migrated from processor B to A. 
= As this floating point data cannot be represented in the external data representation, 
which has only 12 bits for the exponent. 
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= To avoid this XDR should always have at least as many bits in exponents as the longest 

exponent of any processor in distributed system. 
o Handling Mantissa 

=" Consider processor A uses 32 bits, processor B uses 64 bits, and the external data 
representation uses 48 bits. 

=" Migration of a process from processor A to B will have no problem. 

=" But the migration of a process from processor B to A will result in the computation 
being carried out in "half-precision." 

= To overcome this, the XDR must have sufficient precision to handle the largest 
mantissa. 

= The direction of migration should be restricted only to the nodes having a mantissa at 
least as large as the source node. 

=" Worst case is the loss of precision in case of cumulative multiple migrations which 
may lead to invalidating a computation. 

o Handling Signed Infinity and Signed Zero 

= Signed infinity is supported by some architectures and it indicates that computation 
result is too large (overflow) or too small (underflow). 

=" Some architectures use signed zero to represent too large or too small numbers. 

= XDR should take care of these values, so CPUs can take advantage of this information 
or discard it. 


9.4 Advantages of Process Migration 
1. Reducing average response time of processors 
o Process migration facility is be used to reduce the average response time of the processes 
of a heavily loaded node by some processes on idle or underutilized nodes. 
2. Speeding up individual jobs 
= Process migration facility may be used to speed up individual jobs in two ways. 
= The first method is to migrate the tasks of a job to the different nodes of the system 
and to execute them concurrently. 
=" The second approach is to migrate a job to a node having a faster CPU or to a node at 
which it has minimum turnaround time 
3. Gaining higher throughout. 
=" Inasystem with process migration facility, the capabilities of the CPUs of all the nodes 
can be better utilized by using a suitable load-balancing policy. 
= This helps in improving the throughput of the system. 
4. Utilizing resources effectively 
=" Process migration policy also helps in utilizing resources effectively , since any 
distributed system consisting of different resources such as CPU, printers, storage , 
etc. and software (databases, files with different capabilities are optimally used. 
=" Depending nature of process , it can be appropriately migrated to utilize the system 
resource efficiently. 
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5. Reducing network traffic 


Migrating a process closer to the resource it is using most heavily may reduce network 
traffic in the system if the decreased cost of accessing its favourite resources offsets 
the possible increased cost of accessing its less favoured ones. 

Another way to reduce network traffic by process migration is to migrate and cluster 
two or more processes, which frequently communicate with each other, on the same 
node of the system 


6. Improving system reliability. 


Simply migrate a critical process to a node whose reliability is higher than other nodes 
in the system. 

Migrate a copy of a critical process to some other node and to execute both the 
original and copied processes concurrently on different nodes. 

In failure modes such as manual shutdown, which manifest themselves as gradual 
degradation of a node, the process of the node, for their continued execution, may be 
migrated to another node before the dying node completely fails. 


7. Improving system security 


A sensitive process may be migrated and run on a secure node that is not directly 
accessible to general users, thus improving the security of that process. 
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1. Introduction 
e Afile is a subsystem of an operating system that performs file management activities such as 
organization, storing, retrieval, naming, sharing and protection of files. 
e Adistributed file system supports: 


1. 


O 


oO 


Remote Information sharing 

A distributed file system allows a file to be transparently accessed by processes of any 
node of the system irrespective of file’s location. 

User mobility 

User should not be forced to work on one specific node but should have flexibility to work 
on different nodes at different times. 

Availability 

For better fault tolerance file should be available for use even in the event of temporary 
failure of one or more nodes of the system. 

Diskless workstation 

A diskless workstation is more economical, less noisy and generates less heat. 

A distributed system with its transparent remote file sharing capability allows use of 
diskless workstation in system. 


e Adistributed system provides following three types of services: 


di 


O 
O 


000 WwW 


Storage service 

Deals with allocation and management of secondary device spaces. 

It provides a logical view of the storage system by providing operations for storing and 
retrieving data. 

Also known as disk service and block service. 

True file service. 

It is concerned with the operations on individual file such as operation for accessing and 
modifying the data in files and for creating and deleting files. 

Name service 

It provides mapping between text names for files to their IDs. 

It is also known as directory service. 

It performs directory related activities such as creation and deletion of directories, adding 
new file to directory, deleting file from directory, changing name of file, moving a file from 
one directory to another. 


2. Feature and goal of distributed file system 
1. Transparency 


O 


O 


Structure Transparency 

=" In multiple file servers, the multiplicity of file servers should be transparent to the 
clients of a distributed file system. 

= Clients should not know the number or locations of the file servers and the storage 
devices 

Naming Transparency 

= The name of the file should not give any hint regarding file location. 
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o Access Transparency 
=" Client process ona host has uniform mechanism to access all files in system regardless 
of files are on local/remote host. 
o Replication Transparency. 
= Files may be replicated to provide redundancy for availability and also to permit 
concurrent access for efficiency. 
2. User Mobility 
o User should not be forced to work on one specific node but should have flexibility to work 
on different nodes at different times 
3. Performance 
o Performance is average amount of time needed to satisfy client requests. 
o The extra overhead due to network communication for remote file should be hidden to 
users. 
4. Simplicity and ease of use. 
o The user interface should be simple and number of commands should be as small as 


possible. 
5. Scalability 
o Agood distributed system should be able to cope with growth of nodes and users in the 
system. 


o Growth should not affect performance of overall the system 
6. High availability 
o Failures of one or more components of system should not degrade performance of 
system. 
o Stable storage is technique used by several file system for high reliability. 
7. High reliability 
o. The file system should automatically generate backup copies of critical files in order to 
recover from sudden failure. 
8. Data integrity 
o Forashared file, the file system must guarantee the integrity of data stored in it. 
o Concurrent access requests from multiple users who are competing to access the file must 
be properly synchronized by the use of some form of concurrency control mechanism. 
9. Security 
o Adistributed file system should be secure so that its user can be confident of the privacy 
of their data. 
10. Heterogeneity 
o Heterogeneous distributed system provides the flexibility to their users to use different 
computer platforms for different applications. 


3. File models 
1. Unstructured and Structured file 
e Unstructured Files 
o Inthis model there is no substructure known to the file server and the contents of each 
file of the file system appears to the file server as an uninterrupted sequence of bytes. 
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o Unix and MS DOS use this file model. 

o Instructured file model a file appears of the file server as an ordered sequence of records. 

o In this model a record is the smallest unit of file data that can be accessed and the file 
system read or writes operations are carried out on a set of records. 


e Structured Files 

o Structured files are two types (1) non-indexed records and (2) indexed records. 

1) Non-indexed records: model a file record is accessed by specifying its position within the 
file for example fifth record from the beginning of the file or the second record from the 
end of the file. 

2) Indexed records: In indexed records a file is maintained as a B-Tree or other suitable data 
structure or a hash table is used to store quickly. 

2. Mutable and Immutable files. 
e Mutable Files 

o Inthis model update performed on a file overwrites on its old content to produce the new 
contents. 

o Most existing operating systems use the mutable file model. 

e Immutable Files 

o Inthis model file cannot be modified once it has been created except to be deleted. 

© Versioning approach is normally used to implement file updates and each file is 
represented by a history of immutable versions. 

o Rather than updating the same file a new file is created each time a change is made to the 
file contents and the old version is retained unchanged. 

o Immutable file makes easy to support consistent sharing. 

o Disadvantage: increased use of disk space and increased disk allocation activity. 


4. File accessing models 
1. Accessing Remote Files 
e A distributed file system may use one of the following models to service client’s file access 
request. 
1. Remote service model 

o The client's request for file access is delivered to the server, the server machine performs 
the access request, and finally the result is forwarded back to the client. 

o The access requests from the client and the server replies for the client are transferred 
across the network as messages. 

o. The file server interface and the communication protocols must be designed carefully to 
minimize the overhead of generating messages as well as the number of messages that 
must be exchanged in order to satisfy a particular request. 

2. Data caching model 

o If the data needed to satisfy the client's access request is not present locally, it is copied 
from the server’s node to the client's node and is cached there. 

o The client's request is processed on the client's node itself by using the cached data. 

Thus repeated accesses to the same data can be handled locally. 
o Areplacement strategy LRU is used is used for cache management. 


oO 
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2. Unit of Data Transfer 
e Unit of data refers to fraction of file data that is transferred to and from clients as a result of 
a single read or write operations. 
1. File level transfer model 
o Whole file is treated as unit of data transfer between client and server in both the 
direction. 
o Advantages 
=" Transmitting an entire file in response to a single request is more efficient than 
transmitting it page by page as the network protocol overhead is required only once. 
= It has better scalability because it requires fewer accesses to file servers, resulting in 
reduced server load and network traffic. 
=" Disk access routines on the servers can be better optimized if it is known that requests 
are always for entire files rather than for random disk blocks. 
=" Once an entire file is cached at a client's site, it becomes immune to server and 
network failures. 
o Disadvantage 
=" This model requires sufficient storage space on the client's node for storing all the 
requires files in their entirely. 
2. Block level transfer model 
o In this model, file data transfer across the network between a client and a server take 
place in units of file blocks. 
o A file block is a contiguous portion of a file and is usually fixed in length. 
o Inpage level transfer model block size is equal to virtual memory page size. 
o Advantages 
= This model does not require client nodes to have large storage. 
= It eliminates the need to copy an entire file when only a small portion of the file data 
is needed. 
= It provides large virtual memory for client nodes that do not have their own secondary 
storage devices. 
o Disadvantage 
= When an entire file is to be accessed, multiple server requests are needed in this 
model, results in more network traffic and more network protocol overhead 
3. Byte level transfer model 
o Inthis model, file data transfers across the network between a client and a server take 
pace in units of bytes. 
o Advantages 
= This model provides maximum provides maximum flexibility because it allows storage 
and retrieval of an arbitrary sequential subrange of a file, specified by an offset within 
a file, and a length. 
= Cache management is difficult due to variable length data for different requests. 
4. Record level transfer model 
o Inthis model, file data transfers across the network between a client and a server take 
place in units of records. 
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5. File sharing semantics 
1. Unix semantics 
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(a) 
Figure 5.1(a): Example of UNIX file-sharing semantics 
o This semantics enforces an absolute time ordering on all operations. 
o Every read operation on a file sees the effects of all previous write operations performed 


on that file as shown in Figure 5.1(a) 
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Figure 5.1(b): uanicet having delay due to network. 

o This semantics can be achieved in a distributed system by disallowing files to be cached 
at client nodes. 

o Allowing a shared file to be managed by only one file server that processes all read and 
write requests for the file strictly in the order in which it receives them. 

o There isa possibility that, due to network delays, client requests from different nodes may 
arrive and get processed at the server node out of order. 

2. Session semantic 

o Aclient opens a file, performs a series of read/write operations on the file, and finally 

closes the file when he or she is done with the file. 
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o Asession is a series of file accesses made between the open and close operations. 

o Insession semantics, all changes made to a file during a session are initially made visible 
only to the client process that opened the session and are invisible to other remote 
processes who have the same file open simultaneously. 

o Once the session is closed, the changes made to the file are made visible to remote 
processes only in later starting sessions. 

3. Immutable shared file semantics 

o According to this semantics, once the creator of a file declares it to be sharable, the file is 
treated as immutable, so that it cannot be modified any more. 

o Change to the file are handled by creating a new updated version of the file. 

o Therefore, the semantics allows files to be shared only in the read-only mode. 

4, Transaction like semantics 

o This semantics is based on the transaction mechanism, which is a high-level mechanism 
for controlling concurrent access to shared mutable data. 

o Atransaction is a set of operations enclosed in-between a pair of begin_transaction and 
end_transaction like operations. 

o The transaction mechanism ensures that the partial modifications made to the shared 
data by a transaction will not be visible to other concurrently executing transactions until 
the transaction ends 


6. File caching schemes 
6.1. Cache Location 
e Cache location refers to the place where the cached data is stored. 
e Assuming that the original location of a file is on its server's disk, there are these possible 
cache location in a distributed file system. 
e Server’s Main Memory 
o When no caching scheme is used, before a remote client can access a file, the file must 
first be transferred from the server's disk to the server's main memory. 
o Then across the network from the server's main memory to the client's main memory. 
Thus the total cost involved is one disk access and one network access. 
o Advantages 
= It is easy to implement and is totally transparent to the clients. 
= It is easy to keep the original file and cached data consistent. 
= Since a single server manages both the cached data and the file, multiple accesses 
from different clients can be easily synchronized to support UNIX-like file-sharing 
semantics. 
o Disadvantage 
= It does not eliminate the network access cost and does not contribute to the scalability 
and reliability of the distributed file system. 
e Clients’ Disk 
o Acache located ina client's disk eliminates network access cost but requires disk access 
cost on a cache hit. 


O 
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o Advantages 
= Ifthe cached data is kept on the client's disk, the data is still there during recovery and 
there is no need to fetch it again from the server's node. 
= As compared to a main-memory cache, a disk cache has plenty of storage space. 
= Disk cache is useful to those applications that use file level transfer. 
o Disadvantage 
= This policy does not work if the system is to support diskless workstations. 
= Access time increases as every time disk access is done. 
e Client main memory 
o Acache located in a client's main memory eliminates both network access cost and disk 
access cost. 
o It also permits workstations to be diskless. 
o Advantages 
= It provides high scalability and reliability as cache hit access request can be serviced 
locally without the need to contact server. 
o Disadvantage 
=" Client disk memory is small compared to client’s disk size. 


6.2 Modification Propagation 
1. Write through Scheme 
o Inthis scheme, when a cache entry is modified, the new value is immediately sent to the 
server for updating the master copy of the file. 
o Advantage 
= High degree of reliability and suitability for UNIX-like semantics 
= Since every modification is immediately propagated to the server having the master 
copy of the file, the risk of updated data getting lost is very low. 
o Disadvantage 
=" It has poor write performance as write access has to wait until the information is 
written to the master copy of the server. 
o This scheme is suitable for use only in those cases in which the ratio of read-to-write 
accesses is fairly large. 
2. Delayed Write 
o Inthis scheme, when a cache entry is modified, the new value is written only to the cache 
and the client just makes a note that the cache entry has been updated. 
o After some time, all updated cache entries corresponding to a file are gathered together 
and sent to the server at a time. 
o Depending on when the modifications are sent to the file server, delayed-write policies 
are of different types. 
(a) Write on ejection from cache. 
= In this method, modified data in a cache entry is sent to the server when the cache 
replacement policy has decided to eject it from the client's cache. 
= Some data can reside in the client's cache for a long time before they are sent to the 
server 
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= Such data are subject to reliability problem. 
(b) Periodic write. 
= In this method, the cache is scanned periodically, at regular intervals, and any cached 
data that have been modified since the last scan are sent to the server. 
(c) Write on close 
= In this method, the modifications made to a cached data by a client are sent to the 
server when the corresponding file is closed by the client. 
o Advantages of Delayed write 
=" Write accesses complete more quickly because the new value is written only in the 
cache of the client performing the write. 
=" Modified data may be deleted before it is time to send them to the server, in such 
cases, modifications need not be propagated at all to the serve, resulting in a major 
performance gain. 
=" Gathering of all file updated and sending them together to the server is more efficient 
than sending each update separately. 


6.3. Cache Validation Schemes 
e A client's cache entry becomes stale as soon as some other client modifies the data 
corresponding to the cache entry in the master copy of the file. 
e Therefore, it becomes necessary to verify if the data cached at a client node is consistent with 
the master copy. 
e If not, the cached data must be invalidated and the updated version of the data must be 
fetched again from the server. 
e There are two approaches: 
1. Client Initiated Approach 
o Inthis approach, a client contacts the server and checks whether its locally cached data is 
consistent with the master copy. 
(a) Checking before every access 
=" This approach defeats the main purpose of caching because the server has to be 
contacted on every access. 
= But it is suitable for supporting UNIX-like semantics. 
(b) Periodic checking 
= In this method, a check is initiated every fixed interval of time. 
(c) Check on file open. 
= In this method, a client's cache entry is validated only when the client opens the 
corresponding file for use. 
=" This method is suitable for supporting session semantics. 
2. Server Initiated Approach 
o If the frequency of the validity check is high, the client-initiated cache validation approach 
generates a large amount of network traffic and consumes precious server CPU time. 
o Inthis method, a client informs the file server when opening a file, indicating whether the 
file is being opened for reading, writing, or both. 
co. The file server keeps a record of which client has which file open and in what mode. 
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o The server keeps monitoring the file usage modes being used by different clients and 
reacts whenever it detects a potential for inconsistency. 
o When anew client makes a request to open an already open file and if the server finds 
that the new open mode conflicts with the already open mode. 
o The server can deny the request or queue the request or disable caching and switch to 
the remote service mode of operation for that particular file by asking all the clients having 
the file open to remove that file from their caches. 
o Disadvantages 
= It violates the traditional client-server model in which servers simply respond to 
service request activities by clients. 

= It requires that file servers be stateful. 

= Client — initiated cache validation, approach must still be used along with the server- 
initiated approach. 

=" For example, a client may open a file, cache it, and then close it after use. 

= Upon opening it again for use, the cache content must be validated because there is a 
possibility that some other client might have subsequently opened, modified and 
closed the file. 


7. File replication 
7.1 Difference between replication and caching 


A replica is associated with a server, whereas a cached copy is normally associated with a 
client. 

The existence of a cached copy is primarily dependent on the locality in file access patterns, 
whereas the existence of a replica normally depends on availability and performance 
requirements. 

As compared to a cached copy, a replica is more persistent, widely known, secure, available, 
complete and accurate. 

A cached copy is contingent upon a replica. Only by periodic revalidation with respect to a 
replica can a cached copy be useful. 


7.2 Advantages of Replication 


Increased availability. 

o Thesystem remains operational and available to the users despite failures. 

o Byreplicating critical data on servers with independent failure modes, the probability that 
one copy of the data will be accessible increases. 

Increased reliability. 

o Replication allows the existence of multiple copies of their files. 

o Due to the presence of redundant information in the system, recovery from failures 
becomes possible. 

Improved response time. 

o It enables data to be accessed either locally or from a node to which access time is lower 
than the primary copy access time. 
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e Reduced network traffic 

o Ifa file's replica is available with a file server that resides on a client's node, the client's 

access requests can be serviced locally, resulting in reduced network traffic. 
e Improved system throughput 

o Replication also enables several clients' requests for access to the same file to be serviced 

in parallel by different servers, resulting in improved system throughput. 
e Better scalability. 

o By replicating the file on multiple servers, the same requests can now be serviced more 
efficiently by multiple servers due to workload distribution. 

o. This results in better scalability. 

e Autonomous operation. 

o Ina distributed system that provides file replication as a service to their clients, all files 
required by a client for operating during a limited time period may be replicated on the 
file server residing at the client's node. 

o. This will facilitate temporary autonomous operation of client machines. 


7.3. Replication Transparency 
e Replication of files should be designed to be transparent to the users so that multiple copies 
of a replicated file appear as a single logical file to its users. 
1. Naming of Replicas 
o Assignment of a single identifier to all replicas of an object seems reasonable for 
immutable objects because a kernel can easily support this type of object. 
o Allcopies are immutable and identical; there is only one logical object with a give identifier 
o However, in mutable objects, different copies of a replicated object may not be the same 
(consistent) at a particular instance of time. 
o In this case, if the same identifier is used for all replicas of the object, the kernel cannot 
decide which replica is the most up-to-date one. 
o Therefore, the consistency control and management of the various replicas of a mutable 
object should be performed outside the kernel. 
o It is the responsibility of the naming system to map a user-supplied identifier into the 
appropriate replica of a mutable object. 
2. Replication control 
o Replication control includes determining the number and locations of replicas of a 
replicated file. 
o Depending on whether replication control is user transparent or not, the replication 
process is of two types : 
1. Explicit replication. 
= In this type, users are given the flexibility to control the entire replication process. 
= That is, when a process creates a file, it specifies the server on which the file should 
be placed. 
= Then, if desired, additional copies of the file can be created on other servers on explicit 
request by the users. 
=" Users also have the flexibility to delete one or more replicas of a replicated file. 
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2. Implicit/lazy replication. 

= In this type, the entire replication process is automatically controlled by the system 
without users' knowledge. 

= That is, when a process creates a file, it does not provide any information about its 
location. 

=" The system automatically selects one server for the placement of the file. 

= Later the system automatically creates replicas of the file on other servers, based on 
some replication policy used by the system. 


7.4 Multicopy Update Problem 
e As soon as a file system allows multiple copies of the same (logical) file to exist on different 
servers, it is faced with the problem of keeping them mutually consistent. 
1. Read Only Replication 

o This approach allows the replication of only immutable files. 

o Files known to be frequently read and modified only once in a while such as files 
containing the object code of system programs, can be treated as immutable files for 
replication using this approach. 

2. Read Any Write All Protocol 

o In this method a read operation on a replicated file is performed by reading any copy of 
the file and a write operation by writing to all copies of the file. 

o Before updating any copy, all copies are locked, then they are updated, and finally the 
locks are released to complete the write. 

3. Available Copies Protocol 

o This method allows write operations to be carried out even when some of the servers 
having a copy of the replicated file are down. 

o Inthis method, a read operation is performed by reading any available copy, but a write 
operation is performed by writing to all available copies. 

o When a server recovers after a failure, it brings itself up to date by copying from other 
servers before accepting any user request. 

o Failed servers are dynamically detected by high-priority status management routines and 
configured out of the system while newly recovered sites are configured back in. 

4. Primary Copy Protocol 

o Inthis protocol, for each replicated file, one copy is designated as the primary copy and 
all the others are secondary copies. 

o Read operations can be performed using any copy, primary or secondary. 

o But all write operations are directly performed only on the primary copy. 

o Each server having a secondary copy updates its copy either by receiving notification of 
changes from the server having the primary copy or by requesting the updated copy from 
it. 

5. Quorum Based Protocol 

o Suppose that there are a total of n copies of a replicated file F. 

o Toread the file, a minimum r copies of F have to be consulted. 

o. This set off r copies is called a read quorum. 
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Read quorums 


(a) Write quorums (b) 
Figure5.2 :Example of quorum algorithm (a) n=8,r=4,w=5 (b) n=8,r=2,w=7 
o To perform a write operation on the file, a minimum w copies of F have to be written. 
This set of w copies is called a write quorum. 
o The restriction on the choice of the values of rand wis that the sum of the read and write 
quorums must be greater than the total number of copies n (r+ w>n). 
o There is at least one common copy of the file between every pair of read and write 
operations resulting in at least one up-to-date copy in ay read/write quorum. 
o The version number of a copy is updated every time the copy is modified. 
A copy with the largest version number in a quorum is current. 
o Aread is executed as follows: 
= Retrieve a read quorum (any r copies) of F. 
=" Ofther copies retrieved, select the copy with the largest version number. 
= Perform the read operation on the selected copy. 
o Awrite is executed as follows : 
= Retrieve a write quorum (any w copies) of F. 
=" Of the w copies retrieved, get the version number of the copy with the largest version 
number. 
= Increment the version number. 
= Write the new value and the new version number to all the w copies of the write 
quorum. 
o Read-any-write-all protocol. 
=" The read-ay-write-all protocol is actually a special case of the generalized quorum 
protocol with r=1 ad w=n. 
= This protocol is suitable for use when the ratio of read to write operation is large. 
o Read-all-write-any protocol. 
= For this protocol r=n and w=1. This protocol may be used in those cases where the 
ratio of write to read operations is large. 
o Majority-consensus protocol 
= In this protocol, the sizes of both the read quorum and the write quorum are made 
either equal or nearly equal. 
= For example, if n=11, a possible quorum assignment for this protocol will be r=6 and 
w=6 


oO 


oO 
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o Consensus with weighted voting. 


In this approach, a read quorum of r votes is collected to read a file and a write quorum 
of w votes to write a file. 

Since the votes assigned to each copy are not the same, the size of a read/write 
quorum depends on the copies selected for the quorum. 

The number of copies in the quorum will be less if the number of votes assigned to 
the selected copies is relatively more. 

On the other hand, the number of copies in the quorum will be more of the number 
of votes assigned to the selected copies is relatively less. 

To guarantee nonnull intersection between every read quorum and every write 
quorum, the values of r and w are chosen such that r+w is greater than the total 
number of votes (v) assigned to the fie (r+w>v). 

Here, v is the sum of the votes of all the copies of the file. 


Fault Tolerance 

e The primary file properties that directly influence the ability of a distributed file system to 
tolerate faults are as follow: 

e Availability 

o Availability of a file refers to the fraction of time for which the file is available for use. 

o If a network is partitioned due to a communication link failure, a file may be available to 
the clients of some nodes, but at the same time, it may not be available to the clients of 
other nodes. 

e Robustness. 


O 


Robustness of a file refers to its power to survive crashes of the storage device and 
decays of the storage medium on which it is stored. 

Storage devices that are implemented by using redundancy techniques, such as a 
stable storage device, are often used to store robust files. 


e Recoverability 


o Recoverability of file refers to its ability to be rolled back to an earlier, consistent state 
when an operation on the file fails or is aborted by the client. 
Stable Storage 


1. Volatile Storage 
o Example is RAM. 
o It cannot withstand power failure or machine crash. 
o Data is lost in case of power failure or machine crash. 
2. Nonvolatile storage 
o Example is disk. 
© It can withstand CPU failures but cannot withstand transient I/O faults and decay of the 
storage media. 
3. Stable Storage 
© It can withstand transient I/O fault and decay of storage media 


Nirali C Dutiya, CE Department | 2160710 — Distributed Operating System 119 


@ 
| Darshan Unit 5—Distributed File System 


o. The basic idea of stable storage is to use duplicate storage devices to implement a stable 
device. 

o Adisk based stable storage system consists of a pair of ordinary disk (disk 1 and disk 2) 
that are assumed to be decay independent. 

o Each block on disk 2 is an exact copy of the corresponding block on disk 1. 

o Aread operation first attempts to read disk 1, if it fails that disk 2 is read. 

o Awrite operation writes to both disks, but write to disk 2 does not start until that for disk 
1 has been successfully completed. 

o After crash recovery action compares the contents of the two disks block by block. 

o Whenever two corresponding blocks differ, the block having incorrect data is regenerated 
from corresponding block on the other disk. 


8.2 Effect of Service Paradigm on Fault Tolerance 
e A server may be implemented by using any one of the following two service paradigms- 
stateful and stateless. 
1. Stateful File Servers 

o A stateful server maintains client’s state information from one remote procedure call to 
the next. 

o Incase of two subsequent calls by a client to a stateful server, some state information 
pertaining to the service performed for the client as a result of the first call execution is 
stored by the server process. 

o These clients state information is subsequently used at the time of executing the second 


call. 
Client Process Server Process 


Open (filename,mode) ERS Tenis 


Return (fid) 
Read (fid, 100,buf) 


Return (bytes 0 to 99) 
Return (bytes 100 to 199) 


Figure 5.3 : Stateful server 
o For example, let us consider a server for byte-stream files that allows the following 
operations on files. 
o Following Figure5.3 shows example of stateful server. 
o Open(filename,mode) 
= This operation is used to open a file identified by filename in the specified mode. 
= When the server executes this operation, it creates an entry for this file in file table 
that it uses for maintaining the file state information of all the open files. 
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= The file state information normally consists of the identifier of the file, the open mode, 
and the current position of a nonnegative integer pointer, called the read writer 
pointer. 
= When a file is opened , it read-write pointer is set to zero and the server returns to the 
client a file identifier(fid),which is used by the client for subsequent accesses to that 
file. 
o Read(fid,n,buffer) 
= This operation is used to get n bytes of data from the file identified by fid into the 
buffer named buffer. 
= When the server executes this operation, it returns to the client n bytes of file data 
starting from the byte currently addressed by the read write pointer and then 
increments the read writer pointer by n. 
o Write(fid,n, buffer) 
= On execution of this operation , the server takes n bytes of data from the specified 
buffer, writes it into file identified by fid at the byte position currently addressed by 
the read write pointer and then increments the read write pointer by n. 
o Seek(fid,position) 
= This operation causes the server to change the value of the read-write pointer of the 
file identified by fid to the new value specified as position. 
o Close(fid) 
= This statement causes the server to delete from its file table the file state information 
of the file identified by fid. 
2. Stateless File server 
o Astateless server does not maintain any client state information. 


Client Process Server Process 


File state information 


eanecre 
2 Return (0 to 99 bytes) 
Read (filename, 100,100,buf) 


Return( 100 to 199 bytes) 


R ac mT 
Rea sname,0,100.b 
tur tes 


Figure 5.4: Stateless server 

o Every request from client is accompanied with all the necessary parameters to successfully 
carry out desired operation. 

o For example a server for byte stream files that allow the following operations on file is 
stateless. 

o Read(filename,position,n, buffer) 
= On execution of this operation, the server returns to the client n bytes of data of the 

file identified by filename. 
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=" The returned data is placed in the buffer named buffer. 

= The value of actual number of bytes read is also returned to the client. 

= The position within the file from where to being reading is specified as the position 
parameter. 

o Write(filename,position,n,buffer) 

= When the server executes this operation, it takes n bytes of data from the specified 
buffer and writes it into the file identified by filename. 

= The position parameter specifies the byte position within the file from where to start 
writing. 

=" The server returns to the client the actual number of bytes written. 

= Ifa client wishes to have similar effect as that in Figure5.3 , the following two read 
operation must be carried out: 

Read(filename,0,100,buf) 

Read (filename, 100,100,buf) 


9. Trends in Distributed File System 
9.1 New Hardware 
e Currently, all file servers use magnetic disks for storage. 
e Main memory is often used for server caching. 
e Within few years memory may become so cheap that the file system may permanently reside 
in memory, and no disks will be needed. 
e Most current file systems organize files as a collection of blocks, either as a tree or as a linked 
list. 
e With an in-core file system, it may be much simpler to store each file contiguously in memory, 
rather than breaking it up into blocks. 
e Contiguously stored files are easier to keep track of and can be shipped over the network 
faster 


Fiber optic ring 


This file is 
currently 
map in 
network 
interface, 


one bit per , 
cached file = | File cache 


Table giving location 
of each cached file 
Figure 5.5: A hardware scheme to updating shared files. 


e Main memory file servers introduce a serious problem, however. If the power fails, all the 
files are lost. 

e A hardware development that may affect file systems is the optical disk. 

e They are sometimes referred to as WORM (Write Once Read Many) devices. 
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e Another interesting hardware development is very fast fiber optic networks. 

e Since transmission is very fast in fiber optics there is no need for client caching. 

e Consider the system of Fig. 5.5 in which each network interface has a bit map, one bit per 
cached file. 

e To modify a file, a processor sets the corresponding bit in the interface, which is 0 if no 
processor is currently updating the file. 

e Setting a bit causes the interface to create and send a packet around the ring that checks and 
sets the corresponding bit in all interfaces. 

e Ifthe packet makes it all the way around without finding any other machines trying to use the 
file, some other register in the interface is set to 1. 

e Otherwise, it is set to 0. 

e This mechanism provides a way to globally lock the file on all machines in a few microseconds. 

e After the lock has been set, the processor updates the file. Each block of the file that is 
changed is noted 

e When the update is complete, the processor clears the bit in the bit map, which causes the 
network interface to locate the file using a table in memory and automatically deposit all the 
modified blocks in their proper locations on the other machines. 

e When the file has been updated everywhere, the bit in the bit map is cleared on all machines. 


9.2 Scalability 

e Adefinite trend in distributed systems is toward larger and larger systems. 

e Algorithms that work well for systems with 100 machines may work poorly for systems with 
1000 machines and not at all for systems with 10,000 machines. 

e If opening a file requires contacting a single centralized server to record the fact that the file 
is open, that server will eventually become a bottleneck as the system grows. 

e A general way to deal with this problem is to partition the system into smaller units and try 
to make each one relatively independent of the others. 

e Having one server per unit scales much better than a single server. 

e Broadcasts are another problem area. 

e If each machine issues one broadcast per second, with n machines, a total of n broadcasts per 
second appear on the network, generating a total of n2 interrupts total. 

e Incontrast, hash tables are acceptable, since the access time is more or less constant, almost 
independent of the number of entries. 

e Strict semantics, such as UNIX semantics, get harder to implement as systems get bigger 


9.3. Wide Area Networking 
e Most current work on distributed systems focuses on LAN-based systems. 
e Inthe future, many LAN-based distributed systems will be interconnected to form transparent 
distributed systems covering countries and continents. 
e An inherent problem with massive distributed systems is that the network bandwidth is 
extremely low. 
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If the telephone line is the main connection, getting more than 64 Kbps out of it seems 
unlikely. 
Bringing fiber optics into everyone's house will take decades and cost billions. 


9.4 Mobile Users 


A large fraction of the time, the user is off-line, disconnected from the file system. 

Solution is probably going to have to be based on caching. 

While connected, the user downloads to the portable those files expected to be needed later. 
These are used while disconnected. 

When reconnect occurs, the files in the cache will have to be merged with those in the file 
tree. 

Since disconnect can last for hours or days, the problems of maintaining cache consistency 
are much more severe than in online systems. 


9.5 Fault Tolerance 


As distributed systems become more and more widespread, the demand for systems that 
essentially never fail will grow. 

File replication, in current distributed systems, will become an essential requirement in future 
ones. 

Systems will also have to be designed that manage to function when only partial data are 
available, since insisting that all the data be available all the time does not lead to fault 
tolerance. 

Down times that are now considered acceptable by programmers and other sophisticated 
users, will be increasingly unacceptable as computer use spreads to no specialists. 


9.6 Multimedia 


Text files are rarely more than a few megabytes long, but video files can easily exceed a 
gigabyte. 
To handle applications such as video-on-demand, completely different file systems will be 
needed. 


10. Case Study : DCE Distributed File Service 


DCE's DFS has several attractive features and promises to play a major role in future 
distributed computing environments. 

It is derived from the CMU Andrew File System (AFS) but possesses many new features. 

DFS is basically a DCE application that makes use of other services of DEC. 

DFS has been designed to allow multiple file systems to simultaneously exist on a node of a 
DCE system. 

For example, UNIX, NFS and DFS's own file system can coexist on a node to provide three 
different types of file systems to the users of that node. 

The local file system of DFS that provides DFS ona single node is called Episode. 
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e As may be expected, when multiple file systems coexist on a node, features specific to DFS 
are available only to those users who use the Episode file system. 


10.1 DFS File Model 

e Like UNIX, DFS uses the unstructed file model in which a file is an unstructured sequence of 
data. 

e A DFS file server handles the contents of a file as an uninterrupted sequence of bytes. 

e Asingle file can contain up to 242 bytes. 

e Like UNIX, DFS also uses the mutable file model. 

e It also has a facility called that allows two stored sequences of a file to exist simultaneously; 

e One of these is the version of the file before cloning and the other contains the changes made 
to the file after cloning. 


10.2 DFS File System Model 
e As shown in Figure 5.6, the DFS file system model has four levels of aggregation. 
e At the lowest level are individual files. At the next level are directories. 
e Each directory usually contains several files. 


Aggregate 
(One aggregate per disk partition) 


FS; BS; Sees FS, _ Filesets 
ee | a _ Directories 
D, D, . 
ee | Eh =“ 
eens F 
Fj F k 


Figure 5.6 Four levels of aggregation in the DFS file system model 
e Above directories are filesets, each fileset usually contains several directories. 
e Each disk partition holds exactly one aggregate. 
e A fileset may contain all the files of a single user, or all the files of a group of related users. 
e DFS allows multiple filesets per disk partition. 


10.3 DFS File-Accessing Model 
e Distributed File Service relies on a client-server architecture and uses the data-caching model 
for file accessing. 
e A machine in a DCE system is a DFS client, a DFS server, or both. A DFS client is a machine 
that uses files in filesets managed by DFS servers. 
e The main software component of a DFS client machine is the DFS cache manager, which 
caches parts of recently used files to improve performance 
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e A DFS server is a machine having its own disk storage that manages files in the filesets stored 
on its local disk and services requests received from DFS clients. 
o Episode: This is the DFS locak file system. 
o Token manger: The token manager is used to implement the token-based approach for 
handling multicache consistency problems. 
o File exporter. 
=" The file exporter accepts file access requests from clients and returns replies to them. 
= The interaction between clients and the file exporter is done using DCE RPC. 
= It also handles client authentication for establishing secure communication channels 
between clients and the DFS server. 
= The file exporter is multithreaded so that several client requests can be handled 
simultaneously 
© Fileset server. 
=" The fileset server manages the local filesets. 
= It keeps track of how many filesets are there and which fileset belongs to which disk 
partition. 
= It also provides commands that can be used by the system administrator to obtain 
fileset information, to manipulate disk quotas of filesets, and to create, delete, 
duplicate, move, backup, clone, or restore an entire fileset. 
co Fileset location server. 
= The fileset location server keeps information about which DFS servers are managing 
which filesets in the cell. 
= If a fileset is moved from one DFS server to another or is replicated on another DFS 
server, the fileset location server records these changes. 
=" Given the name of a file, the fileset location server returns the address of the DFS 
server that manages the fileset that contains the file. 
= Ifa fileset is replicated, the address of all the DFS servers that manage it are returned. 
= When a DFS client accesses a file by specifying its name, its cache manager gets the 
address of the DFS server that manages the fileset of the file from the fileset location 
server. 
o Replication server 
=" The replication server maintains the consistency of the replicas of filesets. 
e The first three components reside in the kernel space and the latter three reside in the user 
space. 
e DFS uses the block-level transfer model for the unit of data transfer. 
e The block size is 64 kilobytes, so many files will be transferred and cached in their entirely. 


10.4 DFS sharing Semantics 
e DFS server has a component called token manger. 
e The job of the token manager is to issue tokens to clients for file requests and to keep track 
of which clients have been issued what types of tokens for which files. 
e Aclient cannot perform the desired file operation on a piece of file data until it possesses the 
proper token. 
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Figure 5.7 : Token based DFS for implementing the UNIX file sharing 

e The use of token-based approach to implement the single-site UNIX file-sharing semantics 
can best be illustrated with the help of an example. 

e Figure 5.7(a) shows the initial state of a server and two client machines. 

e At this time, client A makes a request to the server for accessing file F1. 

e The server checks its state information to see if the token for file F1 has been given to any 
other client. 

e It finds that it has not been given to any other client, so it sends the token and date of file Fi 
to client A and makes a record of this information for future use. 

e Client A caches the received file data and then continues to perform file access operations on 
this data as many times as needed. 

e Figure 5.7(b) shows the state of the server and client machines after client A receives the 
token and data for file Fi. 

e Now suppose after some time that client B makes a request to the server for accessing the 
same file F1. 

e The server checks its state information and finds that the token for file F1 has been given to 
client A. 

e Therefore, it does not immediately send the token and data for file Fi to client B. 

e It first sends a message to client A asking back the token for file F1. 

e On receiving the revocation message, client A returns the token along with updated data of 
file F: and invalidates its cached data for file F1. 

e The server then updates its local copy of file F1 and now returns the token and up-to-date 
data of file Fi to client B. 
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Client B caches the received file data and then continues to perform file access operations on 
this data as many times as needed. 

Figure 5.7(c) shows the state of the server and client machines after client B receives the 
token and data for file F1. 

In this way, the single-system UNIX file-sharing semantics is achieved because the server 
issues the token for a file to only one client at a time. 


10.5 File-Caching Scheme in DFS 


In DFS recently accessed file data are cached by the cache manager of client machines. The 
local disk of a client machine is used for this purpose. 

In DFS, modifications made to a cached file data are propagated to the file server only when 
the client receives a token revocation message for the file data. 

Cached file data of a client machine is invalidated only when the client receives a token 
revocation message for the file data from the file server. 

As long as the client possesses the token for the specified operation on the file data, the 
cached data is valid and the client can continue to perform the specified operation on it. 

The approach used for cache validation is a server-initiated approach. 


10.6 Replication and Cloning in DFS 


Distributed File Service provides the facility to replicate files on multiple file servers. 

The unit of replication is a fileset. 

A filename is mapped to all file servers having a replica of the file. 

Given a filename, the fileset location server returns the addresses of all the file servers that 
have a replica of the fileset containing the file. 

DFS uses the explicit replication mechanism for replication control. 

The replication server of a server machine is responsible for maintaining the consistency of 
the replicas of filesets. 

The primpary-copy protocol is used for this purpose. 

DFS also provides the facility to clone filesets. 

This facility allows the creating of a new virtual copy of the fileset in another disk partition 
and the old copy is marked read only. 

The system administrator might instruct the system to clone a fileset every day at midnight 
so that the previous day's work always remains intact in an old version of all files. 

If a user inadvertently deletes a file, he or she can always get the old version of the file and 
once again perform the current day's updates on it. 


10.7 Fault Tolerance 


In DFS, for every update made to a file, a log is written to the disk. 

A log entry contains the old value and the new value of the modified part of the file. 

When the system comes up after a crash, the log is used to check which changes have already 
been made to the file and which changes have not yet been made. 

Those that have not been made are the ones that were lost due to system crash. 
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e These changes are now made to the file to bring it to a consistent state. 

e If. an update is lost because the crash occurred before the log for the update was recorded on 
the disk, it does not create any inconsistency because the lost update is treated as if the 
update was never performed on the file. 

e Therefore the file is always in a consistent state after recovery. 

e DFS servers are stateful because they have to keep track of the tokens issued to the clients. 


10.8 Atomic Transaction 

e The DCE does not provide transaction processing facility either as a part of DFS or as an 
independent component. 

e This is mainly because DCE currently does not possess services needed for developing and 
running mission critical, distributed on-line transaction processing (OLTP) applications. 

e Users of DCE who need transaction processing facility can use Transarc Corporation's Encina 
OLTP technology. 

e Encina expands on the DCE framework and provides a set of standards-based distributed 
services for simplifying the construction of reliable, distributed OLTP systems with guaranteed 
data integrity. 

e In addition to Encina, two other transaction processing environments are Customer 
Information Control System (CICS) and Information Management System (IMS) that are being 
used. 


10.9 User Interfaces to DFS 
e File service interface. 
o DFS uses native operating system commands for directory and file operations so that 
users do not need to learn new commands. 
o DFS also has several commands that work only for its own file system (Episode). 
o These include commands to check quotas of different filesets, to locate the server of a 
file, and so on. 
o Toaccess a file, a client may specify the file by its global name or by its cell relative name. 
e Application programming interface. 
o The application programming interface to DFS is very similar to UNIX. 
o Application programmers can use standard file system calls like fopen ( ) for opening a 
file, fread ( ) to read from a file, fwrite ( ) to write to a file, fclose ( ) to close a file, 
e Administrative interface. 
o The administrative interface of DFS provides commands that allow the system 
administrator to handle filesets, to install or remove DFS file servers and to manipulate 
ACLs associated with files and directories. 
o The commands for handling filesets are used by the system administrator to create, 
delete, move, replicate, clone, back up, or restore filesets. 
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1. General Architecture of DSM 


e The DSM systems normally have an architecture of the form shown in Figure 6.1. 


Distributed sharad memory 
{exists only virtually) 


Memory- 
mapping 
manager 


Communication Network 


Figure 6.1: Distributed Shared Memory Architecture 

e Each node of the system consists of one or more CPUs and a memory unit. 

e The nodes are connected by a high-speed communication network. 

e A simple message-passing system allows processes on different nodes to exchange messages 
with each other. 

e The DSM abstraction presents a large shared-memory space to the processors of all nodes. 

e Incontrast to the shared physical memory in tightly coupled parallel architectures, the shared 
memory of DSM exists only virtually. 

e A software memory-mapping manager routine in each node maps the local memory onto the 
shared virtual memory. 

e To facilitate the mapping operation, the shared-memory space is partitioned into blocks. 

e Data caching is a well-known solution to address memory access latency. 

e The idea of data caching is used in DSM systems to reduce network latency. 

e The main memory of individual nodes is used to cache pieces of the shared-memory space. 

e The memory-mapping manager of each node views its local memory as a big cache of the 
shared-memory space for its associated processors. 

e The basic unit of caching is a memory block. 

e When a process on a node accesses some data from a memory block of the shared-memory 
space, the local memory-mapping manager takes charge of its request. 

e Ifthe memory blocks containing the accessed data is resident in the local memory, the request 
is satisfied by supplying the accessed data from the local memory. 

e Otherwise, a network block fault is generated and the control is passed to the operating 
system. 
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The operating system then sends a message to the node on which the desired memory block is 
located to get the block. 

The missing block is migrated from the remote node to the client process's node and the 
operating system maps it into the applications' address space. 

The faulting instruction is then restarted and can now complete. 

Therefore, the scenario is that data blocks keep migrating from one node to another on 
demand but no communication is visible to the user processes. 

That is to the user processes, the system looks like a tightly coupled shared-memory 
multiprocessors system in which multiple processes freely read and write the shared-memory 
at will. 

Copies of data cached in local memory eliminate network traffic for a memory access on cache 
hit, that is, access to an address whose data is stored in the cache. 

Network traffic is significantly reduced if applications show a high degree of locality of data 
accesses. 


2. Design and Implementation Issues of DSM 


di 


4. 


Granularity: 

o Granularity refers to the blocks size of a DSM system, that is, to the unit of sharing and the 
unit of data transfer across the network when a network block fault occurs. 

o Possible units are a few words, a page, or a few pages. 

o Selecting proper block size is an important part of the design of a DSM system because 
block size is usually a measure of the granularity of parallelism explored and the amount of 
network traffic generated by network block faults. 

Structure of shared-memory space. 

o Structure refers to the layout of the share data in memory. 

o The structure of the shared-memory space of a DSM system is normally dependent on the 
type of applications that the DSM system is intended to support. 

Memory coherence and access synchronization 

o Ina DSM system that allows replication of shared data items, copies of shared data items 
may simultaneously be available in the main memories of a number of nodes. 

o The main problem is to solve the memory coherence problem that deals with the 
consistency of a piece of shared data lying in the main memories of two or more nodes. 

o Since different memory coherence protocols make different assumptions and trade-offs, 
the choice is usually dependent on the pattern of memory access. 

o InaDSM system, concurrent accesses to shared data may be generated. 

o Amemory coherence protocol alone is not sufficient to maintain the consistency of shared 
data. 

o Synchronization primitives, such as semaphores, event count, and lock, are needed to 
synchronize concurrent accesses to shared data. 

Data location and access. 

o To share data in a DSM system, it should be possible to locate and retrieve the data 
accessed by a user process. 
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o Therefore, a DSM system must implement some form of data block locating mechanism in 
order to service network data block faults to meet the requirement of the memory 
coherence semantics being used. 

5. Replacement strategy. 

o If the local memory of a node is full, a cache miss at that node implies not only a fetch of 
the accessed data block from a remote node but also a replacement. 

o A data block of the local memory must be replaced by the new data block. 

o Thus, a cache replacement strategy is also necessary in the design of a DSM system. 

6. Thrashing 

o InaDSM system, data blocks migrate between nodes on demand. 

o Therefore, if two nodes complete for write access to a single data item, the corresponding 
data block may be transferred back and forth at such a high rate that no real work can get 
gone. 

o ADSM system must use a policy to avoid this situation usually known as thrashing. 


3. Granularity 
e One of the most visible parameters to the chosen in the design of a DSM system is the block 
size. 
e Several criteria for choosing this granularity parameter are described below. 


3.1 Factors Influencing Block Size Selection 
e Ina typical loosely coupled multiprocessor system, sending large packets of data is not much 
more expensive than sending small ones. 
e This is usually due to the typical software protocols and overhead of the virtual memory layer 
of the operating system. 
e This fact favours relatively large block sizes. 
e However, other factors that influence the choice of block size are described below: 
Paging overhead: 
o Because shared-memory programs provide locality of reference, a process is likely to access 
a large region of its shared address space in a small amount of time. 
o Therefore, paging overhead is less for large block sizes as compared to the paging overhead 
for small block sizes. 
2. Directory size: 
o Another factor affecting the choice of block size is the need to keep directory information 
about the blocks in the system. 
o The larger the block size, the smaller the directory. 
o. This ultimately results in reduced directory management overhead for larger block sizes. 
3. Thrashing: 
o The problem of thrashing may occur when data items in the same data block are being 
updated by multiple nodes at the same time, causing large numbers of data block transfers 
among the nodes without much progress in the execution of the application. 
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o While a thrashing problem may occur with any block size, it is more likely with larger block 
sizes, as different regions in the same block may be updated by processes on different 
nodes, causing data block transfers that are not necessary with smaller block sizes. 

4. False sharing 

o False sharing occurs when two different processes access two unrelated variable that reside 

in the same data block (Figure. 6.2). 


Process Pi accesses () 
data in this area 


Process E accesses 
data in this area (P,) 


A data block 
Figure 6.2: False Sharing 
o Insucha situation, even though the original variables are not shared, the data block appears 
to be shared by the two processes. 
o The larger is the block size, the higher is the probability of false sharing, due to the fact that 
the same data block may contain different data structures that are used independently. 
o Notice that false sharing of a block may lead to a thrashing problem. 


3.2 Using Page Size as Block Size 

e The relative advantages and disadvantages of small and large block sizes make it difficult for a 
DSM designer to decide on a proper size. 

e Therefore, a suitable compromise in granularity, adopted by several existing DSM systems, is 
to use the typical page size of a conventional virtual memory implementation as the block size 
of a DSM system. 

e Using page size as the block size of a DSM system has the following advantages: 

o It allows the use of existing page-fault schemes to trigger a DSM page fault. 

o Thus memory coherence problems can be resolved in page-fault handlers. 

o It allows the access right control to be readily integrated into the functionality of the 
memory management unit of the system. 

o As long as a page can fit into a packet, page sizes do not impose undue communication 
overhead at the time of network page fault. 

o Experience has shown that a page size is a suitable data entity until with respect to memory 
contention. 
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3.3 Structure of shared memory 
e Structure defines the abstract view of the shared-memory space to be presented to the 
application programmers of a DSM system. 
e The three commonly used approaches for structuring the shared-memory space of DSM system 
are as follows : 
1. No structuring. 


O 
O 
O 


O 
O 


Most DSM systems do not structure their shared-memory space. 

In these system, the shared-memory space is simply a linear array of words. 

An advantage of the use of unstructured shared-memory space is that it is convenient to 
choose any suitable page size as the unit of sharing and a fixed grain size may be used for 
all applications. 

Therefore, it is simple and easy to design such a DSM system. 

It also allows applications to impose whatever data structures they want on the shared 
memory. 


2. Structuring by data type. 


O 


O 


In this method, the shared-memory space is structured either as a collection of objects (as 
in Clouds) or as a collection of variables in the source language (as in Munin ). 

The granularity in such DSM systems is an object or a variable. 

But since the sizes of the objects and data types vary greatly, these DSM systems are 
variable grain size to match the size of the object/variable being accessed by the 
application. 

The use of variable grain size complicates the design and implementation of these DSM 
systems. 


3. Structuring as a database 


O 
O 
O 


Another method is to structure the shared memory like a database. 

For example, Linda takes this approach. 

Its shared-memory space is ordered as an associative memory i.e a memory addressed by 
content rather than by name or address called a tuple space, which is a collection of 
immutable tuples with typed data items in their fields. 

A set of primitives that can be added to any base language are provided to place tuples in 
the tuple space and to read or extract them from tuple space. 

To perform updates, old data items in the DSM are replaced by new data items. 

Processes select tuples by specifying the number of their fields and their values or types. 
Although this structure allows the location of data to be separated from its value, it requires 
programmers to use special access functions to interact with the shared-memory space. 


5. Consistency Models 
A consistency model basically refers to the degree of consistency that has to maintain for the 
shared memory data for memory to work correctly for certain set of applications. 
e Itis defined as a set of rules that applications must obey if they want to DSM system to provide 
the degree of consistency guaranteed by the consistency model. 
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5.1 Strict Consistency Model 


The strict consistency model is the strongest form of memory coherence, having the most 
stringent consistency requirement. 

A shared-memory system is said to support the strict consistency model if the value returned 
by a read operation on a memory address is always the same as the value written by the most 
recent write operation to that address, irrespective of the locations of the processes 
performing the read and write operations. 

All writes instantaneously become visible to all processes. 

Implementation of the strict consistency model requires the existence of an absolute global 
time so that memory read / write operations can be correctly ordered to make the meaning of 
“most recent” clear. 

As the existence of an absolute global time in a distributed system is not possible, and hence 
implementation of strict consistency model for DSM system is practically not possible. 


5.2 Sequential Consistency Model 


The sequential consistency model was proposed Lamport. 

A shared-memory system is said to support the sequential consistency model if all processes 
see the same order of all memory access operations on the shared memory. 

The exact order in which the memory access operations are interleaved does not matter. 

If the three operations read (r1), write (w1) , read(r2) are performed on a memory address in 
that order , any of the orderings (r1,w1,r2) , (r1,r2,w1), (w1,r1,r2), (w1,r2,r1), (r2,r1,w1), 
(r2,w1,r1) of the three operations is acceptable provided all processes see the same ordering. 
If one process sees one of the orderings of the three operations and another process sees a 
different one, the memory is not a sequentially consistent memory. 

Note here that the only acceptable ordering for a strictly consistent memory is (r1, w1, r2). 
The consistency requirement of the sequential consistency model is weaker than that of the 
strict consistency model. 

A DSM system supporting the sequential consistency model can be implemented by ensuring 
that no memory operation is started until all the previous ones have been completed. 
A sequentially consistent memory provides one-copy / single-copy semantics because all the 
process sharing a memory location always see exactly the same contents stored in it. 


5.3 Causal Consistency Model 


Sequential consistency model for better concurrency. 

Unlike the sequential consistency model, in the causal consistency model, all processes see only 
those memory reference operations in the same (correct) order that are potentially causally 
related. 

Memory reference operations that are not potentially causally related may be seen by different 
processes in different orders. 

A memory reference operation (Read / write) is said to be potentially causally related to 
another memory reference operation if the first one might have been influenced in any way by 
the second one. 
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A shared memory system is said to support the causal consistency model if all write operations 
that are potentially causally related are seen by all processes in the same (correct) order. 
Write operations that are not potentially causally related may be seen by different processes 
in different orders. 

Note that “correct order” means that if a write operation (w2) is causally related to another 
write operation (w1), the acceptable order is (w1, w2) because the value written by w2 might 
have been influenced in some way by the value written by w1. 

Therefore, (w2, w1) is not an acceptable order. 


5.4 Pipelined Random Access Memory Consistency Model 


The pipelined random-access memory (PRAM) consistency model, proposed by Lipton and 
Sandberg provides a weaker consistency semantics than other consistency models. 

It only ensures that all write operations performed by a single process are seen by all other 
processes in the order in which they were performed as if all the write operations performed 
by a single process are in a pipeline. 

Write operations performed by different processes may be seen by different processes in 
different orders. 

For example, if w11 and w12 are two write operations performed by a process P1 in that order 
and w21 and w22 are two write operations performed by a process P2 in that order, a process 
P3 may see them in the order [w11, w12), (w21, w22)] and another process P4 may see them 
in the order [w21, w22), (w11, w22)]. 

It leads to better performance because a process need not wait for a write operation performed 
by it to complete before starting the next one since all write operations of a single process are 
pipelined. 


5.5 Processor Consistency Model 


The processor consistency model, proposed by Goodman is very similar to the PRAM 
consistency model with an additional restriction of memory coherence. 

A processor consistent memory is both coherent and adheres to the PRAM consistency model. 
Memory coherence means that for any memory location all processes agree on the same order 
of all write operations to that location. 

A processor consistency ensures that all write operations performed on the same memory 
location (no matter by which process they are performed) are seen by all processes in the same 
order. 

Therefore, in the example given for PRAM consistency, if w12 and w22 are write operations for 
writing to the same memory location x, all processes must see them in the same order-w12 
before w22 or w22 before w22. 

For processor consistency both processes P3 and P4 must see the write operations in the same 
order, which may be either [w11, w22), (w21, w22), (w11, w12)]. 
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5.6 Weak Consistency Model 

e The weak consistency model, proposed by Dubois et al. is designed to take advantage of the 
following two characteristics common to many applications: 

1. It is not necessary to show the change in memory done by every write operation to other 
processes. 

o When a process executes in a critical section, other processes are not supposed to see 
the changes made by the process to the memory until the process exits from the critical 
section. 

o All changes made to the memory by the process while it is in its critical section need be 
made visible to other processes only at the time when the process exists from the 
critical section. 

2. Isolated accesses to shared variables are rare. 

e Problem in implementing this idea is determining how the system can know that it is time to 
show the changes performed by a process to other processes since this time is different for 
different applications. 

e A DSM system that supports the weak consistency model uses a special variable called a 
synchronization variable. 

e The operations on it are used to synchronize memory. 

e For supporting weak consistency, the following requirements must be met : 

o All accesses to synchronization variable must obey sequential consistency semantics. 

o All previous write operations must be completed everywhere before an access to a 
synchronization variable is allowed. 

co All previous accesses to synchronization variables must be completed before access to a 
nonsynchronization variable is allowed. 


5.7 Release Consistency Model 
1. All changes made to memory by process are propagated to other nodes. 
2. All changes made to the memory by other processes are propagated from other nodes to 
the process’s node. 

e Since a single synchronization variable is used in the weak consistency model, the system 
cannot know whether a process accessing a synchronization variable is entering a critical 
section or existing from a critical section. 

e For better performance, the release consistency model provides a mechanism to clearly tell the 
system whether a process is entering a critical section or existing from a critical section 

e so that the system can decide and perform only either the first or the second operation when 
a synchronization variable is accessed by a process. 

e This is achieved by using two synchronization variables instead of a single synchronization 
variable. 

e Acquire is used by a process to tell the system that it is about to enter a critical section, so that 
the system performs only the second operation when this variable is accessed. 

e On the other hand, Release is used by a process to tell the system that it has just exited a critical 
section. 
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e Release consistency may also be realized by using the synchronization mechanism based on 
barriers instead of critical sections. 

e A barrier defines the end of a phase of execution of a group of concurrently executing 
processes. 

e All processes in the group must complete their execution up to a barrier before any process is 
allowed to proceed with its execution following the barrier. 

e When a process of a group encounters a barrier during its execution, it blocks until all other 
processes in the group complete their executions up to the barrier. 

e When the last process completes its execution up to the barrier, all shared variables are 
synchronized and then all processes resume with their executions. 

e A barrier can be implemented by using a centralized barrier server. 


6. Implementing Sequential Consistency Model 

e Protocols for implementing the sequential consistency model in a DSM system depend to a 
great extent on whether the DSM system allows replication and migration of shared memory 
data blocks. 

e Following are various migration and replication strategies : 
1. Nonreplicated , nonmigrating blocks (NRNMBs) 
2. Nonreplicated , migrating blocks (NRMBs) 
3. Replicated , migrating blocks (RMBs) 
4. Replicated , nonmigrating blocks (RNMBs) 


6.1 Nonreplicated , Nonmigrating Blocks 

e In this strategy each block of the shared memory has a single copy whose location is always 
fixed. 

e All access requests to a block from any node are sent to the owner node of the block, which 
has the only copy of the block. 

e On receiving request from a client node, the memory management unit (MMU) and operating 
system software of the owner node perform the access request on the block and return a 
response to the client. 


Client node Owner node of block 
(send request and (receives request. perform data 
receives responce) access, and sends response) 
Request 
Response 


Figure 6.3 : Nonreplicated ,nonmigrating blocks (NRNMB) strategy 
Data Locating in the NRNMB strategy: 
o The NRNMB strategy has the following characteristics : 
= There is a single copy of each block in the entire system. 
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= The location of block never changes. 

o The best approach for locating a block in this case is to use a simple mapping function to 
map a block to a node. 

o When a fault occurs, the fault handler of the faulting node uses the mapping function to 
get the location of the accessed block and forwards the access request to that node. 


6.2 Nonreplicated, Migrating Blocks 
e Inthis strategy each block of the shared memory has a single copy in the entire system. 
e Each access to a block causes the block to migrate from its current node to the node from 
where it is accessed. 
e The owner node of a block changes as soon as the block is migrated to anew node. 
e When a block is migrated away it is removed from any local address space it has been mapped 


into. 7 7 
Client node Owner node 
(Becomes new owner node (Owns the block before 
of block after its migration) its migration) 


Block Request 


Block migration 


Figure 6.4: Nonreplicating, migrating blocks 
Data Locating in the NRMB strategy: 
1) Broadcasting 
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Figure 6.5: Structure and location of owned blocks table in the broadcasting data locating mechanism of 
NRMB strategy. 


= Each node maintains an owned blocks table that contains an entry for each block for 
which the node is the current owner as shown in Figure 6.5 

=" When a fault occurs, the fault handler of the faulting node broadcasts a read/write 
request on the network. 
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=" The node currently having the requested block then responds to the broadcast request 
by sending the block to requesting table. 

= A major disadvantage of broadcasting algorithm is that it does not scale well. 

2) Centralized Algorithm 

=" Acentralized server maintains a block table that contains the location information for 
all block in the shared memory space. 

= When a fault occurs, the fault handler of the faulting node (N) sends a request for the 
accessed block to the centralized server. 

=" The centralized server extracts location information of the requested block from the 
block table, forwards the request to that node, and changes the location information in 
the corresponding entry of the block table to node N. 


Node Boundary Node Boundary 
Node 1 | Node i ' Node M 


|_| Block address Owner node 


(remains (changes 
fixed) dynamically) 


Contains an entry 
for each block for 


which this node is 
the current owner 


Block table 
Centralized server 


Figure 6.6: Structure and location of owned blocks table in the Centralized server data locating 
mechanism of NRMB strategy. 


=" On receiving the request, the current owner transfers the block to node N, which 
becomes the new owner of the block. 

=" Amajor drawback of centralized server serializes location queries, reducing parallelism. 

= The failure of the centralized server will cause the DSM system to stop functioning. 

3) Fixed distributed server algorithm 

=" Scheme is a direct extension of the centralized server scheme. 

=" There is block manager on several nodes and each block manager is given a 
predetermined subset of data blocks to manage. 

= The mapping from data blocks to block managers and their correspomding nodes is 
described by a mapping function. 

= Whenever fault occurs , the mapping function is used by the fault handler of the faulting 
node to find out the node whose block manager is manager is managing the currently 
accessed block. 
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Figure 6.7: Structure and location of owned blocks table in the Fixed distributed data locating mechanism 
of NRMB strategy. 


=" Then a request for the block is sent to the block is sent to the block manager of that 
node. 

=" The block manager handles the request exactly in the same manner as that described 
for centralized server algorithm. 

4) Dynamic distributed server algorithm. 

=" Each node has a block table that contains the ownership information for all block. 

= A field gives the node a hint on the location of the owner of a block and hence is called 
the probable owner. 
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Figure 6.8: Structure and location of owned blocks table in the Dynamic distributed data locating 
mechanism of NRMB strategy. 


= When fault occurs, the faulting node extracts from its block table the node information 
stored in the probable owner field of the entry for the accessed block. 

= It then sends a request for the block to that node. 

= If that node is the true owner of the block , it transfers the block to node N and updates 
the location information of the block in its local block table to node N. 
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=" Or, it looks up its local block table , forwards the request to the node indicated in the 
probable owner field of the entry for the block and updatets the value of this field to 


node N. 
= When node N receives the block, it becomes the new owner of the block. 
Advantage: 
= No communications costs are incurred when a process accesses data currently held 
locally. 
= It allows the applications to take advantage of data access locality. 
Drawbacks: 


= It is prone to thrashing problem. 
=" The advantage of parallelism cannot be availed in this method also. 


6.3 Replicated, Migrating Blocks 
e With replicated blocks, read operations can be carried out in parallel at multiple nodes by 
accessing the local copy of the data. 
e The average cost of read operations is reduced because no communication overhead is 
involved if a replica of the data exists at the local node. 
e Replication increases the cost of write operations as to write a block all its replicas must be 
invalidated or updated to maintain consistency. 


Data Locating in the RMB strategy: 
1. Broadcasting 

=" Each node has an owned blocks table, which has an entry for each block for which the 
node is the owner. 

=" Each entry of this table has a copy set field that contains a list of nodes that currently 
have a valid copy of corresponding copy. 

= When a read fault occurs, the faulting node (N) sends a broadcast read request on the 
network to find the owner of the required block. 
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Figure 6.9: Structure and location of owned blocks table in the broadcasting data locating mechanism of 
RMB strategy. 


= The owner of the block responds by adding node N to the block’s copy set field in its 
owned blocks table and sending a copy of the block to node N. 
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When a write fault occurs, the faulting nodes sends a broadcast write request on the 
network for the required block. 

On receiving this request, the owner of the block relinquishes its ownership to node N 
and sends the block and its copy set to node N. 

When node N receives the block and the copy set, it sends an invalidation message to 
all nodes in the copy set. 

Node N now becomes the new owner of the block, and therefore an entry is made for 
the block in its local owned blocks table. 

The copy set field of the entry is initialized to indicate that there are no other copies of 
the block since all the copies were invalidated. 


2. Centralized server algorithm 


In this strategy each entry of the block table, managed by the centralized server, has an 
owner-node field that indicates the current owner node of the block and a copy set field 
that contains a list of nodes having a valid copy of block. 

When a read/write fault occurs, the faulting node (N) sends a read/write fault request 
for the accessed block to the centralized server. 

For a read fault, the centralized server adds node N to the block’s copy set and returns 
the owner node to node N. 

For a write fault, it returns both copy set and owner node information to node N and 
initializes the copy set field to contain only node N. 

Node N sends a request for the block to the owner node. 

On receiving request, the owner node returns a copy of the block to the owner node N. 
In a write fault node N also sends an invalidate message to all nodes in the copy set. 
Node N can then perform the read/write operation. 


Node boundary Node boundary 
Node 1 : Node / | Node M 


Contains an entry for each block 
in the shared-memory space 


Block table 


Figure 6.10: Structure and location of owned blocks table in the Centralized server data locating mechanism of 


RMB strategy. 


3. Fixed Distributed Server algorithm 


In this there is a block manager on several nodes as shown in Figure 6.11. 

Each block manager manages a predetermined subset of blocks, and a mapping 
function is used to map a block to a particular block manager and its corresponding 
node. 
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=" When a fault occurs, the mapping function is used to find the location of the block 
manager that is managing the currently requested block. 
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Figure 6.11: Structure and location of owned blocks table in the fixed distributed server data locating 
mechanism of RMB strategy. 
= A request for the accessed block is sent to the block manager of that node. 
= Request is handled exactly the same as that described for the centralized server 
algorithm. 
4. Dynamic Distributed 
=" Each node has a block table that contains an entry for all blocks in the shared memory 
space. 
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Figure 6.12: Structure and location of owned blocks table in the Dynamic distributed server data locating 
mechanism of RMB strategy. 
=" Each entry of table has a probable owner field that gives the node a hint on the location 
of the owner of the corresponding block. 

= Also a copy set field is present that lists nodes having a valid copy of the block. 

=" When a fault occurs, the fault handler of the faulting node extracts the probable owner 
information for its local block table and send a request for the block to that node. 

= If that node is not the true owner of the block, it looks up its local block table and 
forwards the request to the node indicated in the probable owner field of the entry for 
the block. 
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= If the node is the true owner of the block, it proceeds with request. 

=" Fora read fault it adds node N in the copy set field of the entry corresponding to the 
block and sends a copy of the block to node, which performs read operation. 

=" Fora write fault, it sends a copy of the block and its copy set information to node N and 
deletes the copy set information of that block from the local block table. 

=" On receiving the block and copy set information, node N sends an invalidation request 
to all nodes in the copy set. 


6.4 Replicated, Nonmigrating Blocks 
e Inthis strategy, a shared-memory block may be replicated a multiple nodes of the system, but 
the location of each replica is fixed. 
e Aread or write access toa memory address is carried out by sending the access request to one 
of the nodes having a replica of the block containing the memory address. 
e All replicas of a block are kept consistent by updating them all in case of a write access. 


Data Locating in RNMB Strategy: 
o The best approach of data locating for handling read/write operations is to have a block 
table at each node and a sequence table with the sequencer. 
o The block table of a node has an entry for each block in the shared memory. 
o Each entry maps a block to one of its replica locations. 
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Figure 6.13: Structure and location block table and sequence table in the centralized sequencer data 
locating mechanism for RNMB strategy. 


o The sequence table also has an entry for each block in the shared-memory space. 
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Each entry of the sequence table has three field-a field containing the block address a 
replica- set field containing a list of nodes having a replica of the block, a sequence number 
field that is incremented by 1 for every new modification performed on the block. 

For performing a read operation on a block, the replica location of the block is extracted 
from the local block table and the read request is directly sent to that node. 

A write operation on a block is sent to the sequencer. 

The sequencer assigns the next sequence number to the requested modification. 

It then multicasts the modification with this sequence number to all the nodes listed in the 
replica-set field of the entry for the block. 

The write operations are performed at each node in sequence number order. 


7. Replacement Strategy 
7.1 Which Block to Replace. 


if 


Usage Based versus non Usage based. 
= Usage based algorithms keep track of the history of usage of a cache line (or page) and 
use this information to make replacement decisions. 
= Least recently used (LRU) is an example of this type of algorithm. 
= Non-usage based algorithms do not take the record of use of cache lines into account 
when doing replacement. 
= First in, first out (FIFO) and Rand (Random or pseudorandom) belong to this class. 
Fixed space versus variable space. 
= Fixed space algorithms assume that the cache size is fixed while variable space 
algorithms are based on the assumption that the cache size can be changed dynamically 
depending on the need. 
=" Replacement in fixed space algorithms simply involves the selection of a selection of a 
specific cache line. 
= A variable space algorithm, a fetch does not imply a replacement, and a swap out can 
take place without a corresponding fetch. 
=" Variable space algorithms are not suitable for a DSM system became each node’s 
memory that acts as cache for the virtually shared memory is fixed in size. 
=" In the DSM system of IVY, each memory block of a node is classified into one of the 
following five types. 
(a) Unused: 
e A free memory block that is not currently being used. 
e |has highest replacement priority. 
(b) Nil: 
e A block that has been invalidated. 
e Same priority as unused i.e highest replacement priority. 
(c) Read-only: 
e A block for which the node has only read access right. 
e The read only blocks have the next replacement priority. 
e This is because a copy of a read only block is available with its owner, and 
therefore it is possible to simply discard that block 
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(d) Read owned: 

e A block for which the node has only read access right but is also the owner of 
the block. 

e Read owned and writable blocks for which replica(s) exist on some other node(s) 
have the next replacement priority because it is sufficient to pass ownership to 
one of the replica nodes. 

(e) Writable : 

e A block for which the node has write access permission. 

e Read owned and writable blocks for which only this node has a copy have the 
lowest replacement priority because replacement of such a block involves 
transfer of the block’s ownership as well as the block from the current node to 
some other nodes. 

e AnLRU policy is used to select a block for replacement when all the blocks in the 
local cache have the same priority. 


7.2 Where to place a Replaced Block. 
e The two commonly used approaches for storing a useful block at the time of its replacement 
are as follows : 
1. Using secondary storage: 
= The block is simply transferred on to a local disk. 
=" The advantage of this method is that it does not waste any memory space and if the 
node wants to access the same block again, it can get the block locally without a need 
for network access. 
2. Using memory space of other nodes 
=" Other method for storing a useful block is to keep track of free memory space at all 
nodes in the system and to simply transfer the replaced block to the memory or a node 
with available space. 
=" This method requires each node to maintain a table of free memory space in all other 
nodes. 
= This table may be updated by having each node piggyback its memory status 
information during normal traffic. 


8. Thrashing 


e Thrashing occurs when the system spends a large amount of time transferring shared data 
blocks from one node to another. 

e It is aserious performance problem with DSM systems that allow data blocks to migrate from 
one node to another. 

e Thrashing may occur in the following situations : 

(a) When interleaved data accesses made by processes on two or more nodes causes a data 
block to move back and forth from one node to another in quick succession (a ping — pong 
effect) 

(b) When blocks with read only permissions are repeatedly invalidated soon after they are 
replicated. 
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e Such situations indicate poor (node) locality in references. If not properly handled, thrashing 
degrades system performance considerably. 

e The following methods may be used to solve the thrashing problem in DSM systems. 

(1) Providing application — controlled locks. 


oO 


Locking data to prevent other nodes from accessing that for a short period of time can 
reduce thrashing. 


o An application controlled lock can be associated with each data block to implement this 


method. 


(2) Nailing a block to a node for a minimum amount of time. 


O 


Disallow a block to be taken away from a node until a minimum amount of time t elapses 
after its allocation to that node. 


o The time t can either be fixed statically or be turned dynamically on the basis of access 


O 


O 
O 


patterns. 

The main drawback of this scheme is that it is very difficult to choose the appropriate value 
for the time. 

If the value is fixed statically, it is liable to be inappropriate in many cases. 

Therefore, tuning the value of t dynamically is the preferred approach in this case; the value 
of t for a block can be decided based on past access patterns of the block. 


(3) Tailoring the coherence algorithm to the shared data usage pattern. 


O 


O 


Thrashing can also minimized by using difference coherence protocols for shared data 
having different characteristics. 

For example, the coherence protocol used in Munin for write shared variables avoids the 
false sharing problem, which ultimately results in the avoidance of thrashing. 


8.1 Advantages of DSM 
e Distributed Shared Memory is a high level mechanism for interprocess communication in 
loosely coupled distributed systems. 
1. Simpler Abstraction 


O 


O 


Programming loosely coupled distributed memory machines using message passing models 
is tedious and error phone. 

RPC was introduced to provide a procedure call interface. 

Even in RPC, since the procedure call is performed in an address space different from that 
of the caller’s address space, it is difficult for the caller to pass context related data or 
complex data structures; that is, parameters must be passed by value. 

The shared memory programming paradigm shields the application programmers from 
many such low level concerns. 

Thus primary advantage of DSM is the simpler abstraction it provides to the application 
programmers of loosely coupled distributed memory machines. 


2. Better Portability of Distributed System 


O 


Distributed application programs written for a shared memory multiprocessor system can 
be executed on a distributed shared memory system without change. 

Therefore, it is easier to port an existing distributed application program to a distributed 
memory system with DSM facility than to a distributed men system without this facility. 
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3. Better Performance of Some Applications 
(a) Locality of Data: 
=" The communication model of DSM is to make the data more accessible by moving it 
around. 
= DSM algorithms normally move data between nodes in large blocks. 
=" Thus in those applications that exhibit a reasonable degree of locality in their data 
accesses, Communication overhead is reduced over multiple memory accesses. 
= This ultimately results in reduced overall communication cost for such applications. 
(b) On Demand Data Movement 
=" The computation model of DSM also facilitates on demand movement of data as they 
are being accessed. 
=" There are several distributed applications that execute in phase, where each 
computation phase is preceded by a data exchange phase. 
= The time needed for the data exchange phase is often dictated by the throughput of 
existing communication bottlenecks. 
= Therefore, in such applications, the on demand data movement facility provided by 
DSM eliminates the data exchange phase spreads the communication load over a longer 
period of time and allows for a greater degree of concurrency. 
(c) Larger Memory Space 
=" With DSM facility, the total memory size is the sum of the memory sizes of all the nodes 
in the system. Thus, paging and swapping activities, which involve disk access, are 
greatly reduced. 
4. Flexible Communication Environment 
o The shared memory paradigm of DSM provides a more flexible communication 
environment than message passing approach in which the sender process need not specify 
the identity of the receiver processes of the data. 
o It simply places the data in the shared memory and the receivers access it directly from the 
shared memory. 
o Therefore, the coexistence of the sender and receiver processes is also not necessary in the 
shared memory paradigm. 


Nirali C Dutiya, CE Department | 2160710 — Distributed Operating System 149 


C | Darshan Unit 7—Naming in Distributed System 


Institute of Engineering & Technology 


1. Overview 
e Adistributed system comprises various objects like nodes, processes, files, |/O devices, etc. 
e The naming system plays a vital part in the design of a distributed system. 
e The naming system comprises two important mechanisms: 
o The naming mechanism 
o The locating mechanism. 
e The naming mechanism incorporated in the distributed OS enables character names to be 
assigned to these objects so that they can be subsequently referred to. 
e The locating mechanism is responsible for mapping the object’s name to the object’s location 
in the distributed system. 
e The most important function of the naming system is to provide location transparency in a 
distributed system. 
e It also facilitates transparent migration, replication of objects, and object sharing. 
e An object can also have multiple names. 
e The naming system also maps these multiple names to the same object. 


2. Features 
1. Location transparency 
o This feature implies that the name of the object must not indicate the physical location of 
the object, directly or indirectly. 
o The name should not depend on physical connectivity, topology of the system, or even the 
current location of the object. 
2. Location Independency 
o Any distributed system allows object migration, movement, and relocation of objects 
dynamically among the distributed nodes. 
o Location independency means that the name of the object need not be changed when the 
object’s location is changed. 
o Also the user must be able to access the object irrespective of the node from where it is 
being accessed. 
3. Scalability 
o The naming system should be capable of adapting to the dynamically-changing scale, 
leading to dynamic changes in namespace. 
o Any changes in system scale should not require any change in naming or location 
mechanisms. 
4. Uniform Naming Convention 
o Most of distributed systems use some standard naming conventions for naming different 
types of objects. 
o File names are named differently from user names and process names. 
o Ideally a good naming system must use the same naming mechanism for all types of objects, 
so as to reduce the complexity of designing. 
5. Meaningful names 
o Meaningful names indicate what the object actually contains. 
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o A good naming system must support at least two levels of object identifiers; one which is 
convenient for users and the other which is relevant for machines. 
6. Allow multiple user-defined names for the same object 
o Incase of a sharable object, users must be allowed to use their own names for accessing 
the object. 
o Anaming system must have the flexibility of assigning multiple user-defined names to the 
same object. 
co. Also it should be possible for the user to either change or delete the user-defined name for 
the object without affecting the names given by other users to the same object. 
7. Group Naming 
o Inalmost all distributed systems, broadcast facility is used to communicate with groups of 
nodes, which may not comprise the entire distributed system. 
o Hence a good naming system must allow many different objects to be identified by the 
same name. 
8. Performance 
o One important aspect of performance of a naming system is the time required to map an 
object’s name to its attributes. 
o This depends on the number of messages transmitted to and fro during the mapping 
operation. 
o For good performance, the naming system should use a small number of message 
exchanges. 
9. Fault tolerance 
o The naming system must be capable of tolerating faults such as the failure of a node or 
network congestion. 
o Thenaming system should continue functioning, may be poorly, in case of such failures, but 
should still be operational. 
10. Replication transparency 
o Replicas are generally created in a distributed system to improve system performance and 
reliability. 
o A good naming system should support multiple copies of the same object in a user- 
transparent manner 
o The user need not be aware that multiple copies exist in the system. 


11. Locating the nearest replica 
Replica of 


Figure7.1: Locating nearest replica 
o Whena naming system supports replicas, the object location mechanism should locate the 
nearest replica of the requested object. 
o As shown in Fig. 7.1, locating a replica from node N2 is better than locating from N3 or N4 
which being farther placed, incurs longer message transmission time and is hence 
undesirable. 
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12. Locating all replicas 
o From a reliability and consistency point of view, the object-locating mechanism must be 
able to locate all replicas, if need be. 
o As shown in the Fig7.1, consider a situation where N1 makes a request for an object and 
the communication link to N2 fails. 
o The replica is available at N4 which is further away but can be used. 


3. Basic concepts 
3.1 Name 


e Anameis basically a set of symbols comprising characters, numerals, and punctuation symbols 

taken from the ASCII character set. 

A name is an identifier, as it identifies an object. 

e Examples of names are: TEST, S$asd123, node-1!, 234wer, etc. 

e There are two basic types of names used in practice: human-oriented names and system- 
oriented names. 

1. Human Oriented Names 

A human-oriented name is a set of characters that is meaningful to users. 

Typical examples of human-oriented names are /user/123, /project/test-1, etc. 

Human-oriented names are defined by the users. 

Human-oriented names are independent of the physical location of the object. 

Human-oriented names are also called high-level names. 

2. System Oriented Names 


O00 0 0 


First-level Second-level 
mapping mapping 


Physical address 
of named objects 


Figure7.2:Mapping names in namespace 

o The system-oriented names are bit patterns of fixed size which can be automatically 
generated for each object when the object is created. 

o These names can be easily manipulated and stored by machines. 

o These names are also called unique identifiers or low-level names. 

o The naming model with a relation between human-oriented and system-oriented names is 
shown in Fig 7.2. 

o Asshown the human-oriented name is mapped to a system-oriented name, and at the next 
level, it is further mapped to the corresponding replicas of the object. 

e There are multiple names used to name the same object and conversely the same name may 
mean different objects 
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e These are abbreviation/alias, absolute/relative name, generic/multicast name, 
descriptive/attribute-based name, and source routing name. 
(a) Aliases/abbreviations 


User 1 User 2 
(myobject) (myobject) 


Object 1 Object 1 
Figure7.3:Aliases or abbreviations 


o Ina partitioned namespace, a qualified name may be a combination of several names and 
hence of long length. 

o Toavoid inconvenience to the user, the naming convention may allow users to access these 
objects with user-defined short forms for qualified names called aliases or abbreviations. 

o These names form the private context of the user and hence a separate mapping is 
maintained on this context basis. 

o The name resolution in case of using aliases works like this: 
=" The user specifies an alias or an abbreviation. 
= It is converted to a qualified name by using the mapping associated with the private 

context. 

= A binding associates a simple name with a qualified name. 

o A naming system provides this flexibility that implies more than one simple name can be 
bound to a single qualified name within a given context, called the synonym or nickname. 


(b) Absolute and relative names 
Root 
context 


Figure7.4:Absolute and relative names 
o The object name can be of two types: absolute and relative. 
=" An absolute name is specified from the root to the specified object. 
=" Arelative name is specified from the current context to the specified object 
o Fig7.4 depicts a hierarchical namespace. 
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o The user’s context is context5 and the object to be located is object-1. 

o Absolute name of the object is: Root context/context1/context3/context5/object-1 

o Relative name of this same object in context 5 is: context5/object-1 

(c) Generic and multicast names 

o Generic name: 
=" A name is mapped to any one of the set of objects to which it is bound. 
= This facility is used in the scenario when a user wants a request to be serviced by any of 

the servers capable of servicing the request. 

= The user is not concerned about the server that services the request. 

o Group or multicast naming: 
=" Here, aname is mapped to all the objects to which it is bound. 
=" This naming method is useful for broadcasting or multicasting. 

(d) Descriptive/Attribute-based names 

o Anaming system supports descriptive or attribute-based names and allows an object to be 
named with a set of attributes. 

o Anattribute has a type (indicating format and meaning) and a value. 

o Examples of attributes are: 
User: Sunita 
Creation date: 20/12/2008 
File type: Source 
Language: C 
Name: Test1 

o To identify an object, it is not necessary to specify all the attributes. 

o Multicast naming facility can be easily provided with this naming method by constructing 
an attribute for a list of names. 

(e) Source Routing 

o Insome scenario, the namespace may be of the same structure as the physical network. 

o Such anamespace defines source routing names to identify a path through the network of 
a distributed system. 

o Atypical example is the UNIX-to-UNIX Copy (UUCP) namespace that defines the namespace 
in the format host-1!host-2!host-3!heresh. 

co. This style is called as source routing namespace because the route through the network is 
specified at the source computer. 


3.2 Namespace and Contexts 

o Anamespace is defined as the set of names within a distributed system that complies with 
the naming convention. 

o Namespace of a distributed system can be classified as flat namespace and partitioned 
namespace. 

o Flat namespace is a simple space with names as character strings. 

o Flat names do not have any structure and hence they are not suitable for large systems. 

o Partitioned namespaces are desirable in situations involving a large set of objects. 
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o The partitioning is done syntactically and the name structure represents physical 
association. 

o Each of these partitions is called a domain of the namespace, which is a flat namespace 
within itself. 

o Itis possible that two domains can have a common name defined within them. 

o All objects cannot be identified uniquely by simple names and so compound names can be 
used. 

o It comprises one or more simple names separated by a delimiter like a special character 

(e.g. /, %, S, @, etc). 

For example, UNIX uses the ‘/’ delimiter. 

A typical example of a compound namespace is the hierarchical namespace. 

This is partitioned at various levels like an inverted tree structure. 

A context is an environment within which a name is valid. 

A context/name pair forms a qualified name that identifies an object uniquely. 

Contexts can represent the division of a namespace along geographical, organizational, or 

functional boundaries. 

o Ina partitioned namespace, each domain is a context in which those names are valid. 

o Following are features of context 
= Names in a context can be generated regardless of what names exist in other contexts. 


= The same name can occur in more than one context. 
Context 
Cy 


OoOO000 0 0 


Context 
C3 


; Object N 
Object 1 


Figure7.5:Nested Context 
= Contexts can be nested where a qualified name consists of identifying a context and 
subsequent sub-contexts within it. 
= Fig 7.5 shows how contexts are nested. 
=" The qualified name of object O1 which is associated with context C3 is Ci/C2/C3/O1. 


3.3 Name Server 
o Aname server (process) is used to maintain information about named objects and enable 
users to access this name information. It also binds the object’s name to its property (such 
as location). 
o Eachname server will store information about a subset of objects in the distributed system. 
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The authoritative name servers store the information about the object and the naming 
service maintains these names. 

The partitioned namespace maintains information about each domain and not each and 
every object, thus reducing the amount of configuration data required in each name server. 
As shown in Fig 7.6, the authoritative name servers maintain information as below: 


Domain 
root 


Domain 
Dy 


Figure7.6:Name server in hierarchical namespace 
= The root node stores information about only the location of the name servers which 
branch out from the Root, D1, D2, and D3. 
=" Domain D1 node stores information about domains D4 and DS. 
"= Domain D1 node stores information about domains D4, D7 and D8 


3.4 Name Agent 


e) 
O 


Name agents act as an interface between the name servers and the clients 

The major function of a name agent is to maintain information about all the existing name 

servers in the system. 

Whenever a client requests a name service, the name agent transfers the user’s request to 

an appropriate name server. 

The reply received from the name server is then forwarded to the client. 

Name agents can be classified as private and shared. 

=" A private name agent works for a single client and is structured as a set of subroutines 
linked to a client program. 

=" Ashared name agent is structured as a part of the OS kernel with system calls to invoke 
the naming service operations; alternatively it can be accessed via IPC primitives. 


3.5 Name Resolution 


O 


Name resolution is the process of mapping information about an object’s name with its 
properties like location. 

This process is carried out at the authoritative name servers because the object’s properties 
are stored and maintained here. 

The initial step of the name resolution process is to locate the authoritative name server of 
the object, followed by invoking operations for either reading or updating the object’s 
properties. 

Within the distributed system, each name agent knows at least one name server. 

The name resolution process works as follows: 
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A client contacts the name agent. 

The name agent contacts the known name server to locate the object. 

If the object is not located, there then this known name server contacts other name servers. 
In a partitioned namespace, the name resolution process involves traversing the context 
chain till the specific authoritative name server is encountered. 


Context (n) 
! 
: 

Mapping 

direction Context (P) 


! 
| Mapping to next level 
| 


Authoritative nam 
Sen eee pepject —~CHame GID Context (2) 
Figure7.7: Name Resolution in a partitioned namespace. 

As depicted in Fig 7.7, the name is first interpreted in the context to which it is associated. 
This interpretation returns two possible values. 
= Returns the name of the authoritative name server of the named object and the name 

resolution process ends. 
= Returns the new context to interpret the name and the process of name resolution 

continues till the authoritative name server for that object is encountered. 


4. System oriented names 
4,1 Features of System Oriented Names 


O 
O 


oO 


These names are integers or bit strings, even as large as 2128 — 1. 

These names are also referred to as unique identifiers, since they are guaranteed to be 
unique in both time and space. 

They do not change during their lifetime and they are never reused. 

In a naming system, these names will always be uniform. 

They are of the same size irrespective of the type or location of the object identified by 
the names. 

It is easy to perform operations like hashing, sorting, etc. on them. 

These names are hard to crack and hence prove useful in case of security-related 
situations. 

These names are automatically generated by the system. 


4.2 Types of System Oriented Names 


O 
O 


System-oriented names can either be unstructured or structured. 

Unstructured names have a single set of integers or a bit string which uniquely identifies 
the object but cannot give any other information about the object. 

Structured names contain an additional component which provides information about the 
properties of the object identified by its name. 
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o Both these types of names are shown in Fig 7.8. 


Set of integers or Unstructured 
bit strings name 


Mode Local unique Structured 
identifier identifier name 


Figure7.8:Types of system oriented names 


4.3 Approaches to Generate System Oriented Names 
1. Centralized Approach 
o Thecentralized approach is basically used for generating unstructured names and is 
shown in Fig 7.9. 


Central node 


Node 1 
Central global New object 
unique identifier 
generator 
Node 2 


Node N 


Figure7.9:Centralized approach to system oriented names 

o This system incorporates a centralized global unique identifier generator that generates a 
standard global unique identifier for each object in the system. 

o Then, any method is used to partition this global namespace among local domains of each 
node. 

o Now each node may either bind or map the global identifier to the locally created object. 

o Thecentralized global unique identifier generator may become a bottleneck for large 
namespaces. 

2. Distributed Approach 

o It forces the naming system to use structured object identifiers. 

o The hierarchical concatenation strategy is used to create global unique identifiers. First, 
each domain is given a unique identifier. 

o As shown in Fig 7.10, the global unique identifier for an object is created by concatenating 
the unique identifier of the domain (component 1) with an identifier used within the 
domain (component 2). 
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name Obj1 Objn Obj 1 Obj2 
F Object ID within domain Server-specific 
using timestamp object ID 
e.g.n1.f1 (t1 = timestamp) 


Figure7.10: Hierarchical Concatenation Strategy 
Each component in turn can consist of more than one subcomponent. 
The various techniques to generate unique identifiers are: 
o Using timestamp 
=" Here, each node is treated as a domain of namespace and uses the clock timestamp as 
a unique identifier within the node. 
= The global identifier is in the form of a pair (Node-ID, timestamp). 
© Using servers as a domain 
= Each server is treated as a domain and hence, the server generates object identifiers. 
=" The global identifier takes the form of a pair (Server-ID, server-specific unique 
identifier). 
o The distributed approach has better efficiency and reliability as compared to the centralized 
approach. 
© But it also has a few drawbacks such as non-standard length of identifiers and lack of 
transparency. 
Handling system crashes during name generation 
o Adistributed system must provide fault-tolerance in the event of a crash and subsequent 
loss of state information of the unique identifier generator. 
o Itis also possible that on recovery, the unique identifier generator may not function 
correctly, thus resulting in non-uniform identifiers. 
o This problem can be resolved using either a clock that operates across failures or using 
multiple levels of storage. 
o Using a clock that operates across failures 
=" Aclock can be used at the location of the unique identifier generator. 
= It will ensure continuous operation in the event of failure. 
=" This method has just one drawback in that it may require a long identifier which is 
proportional to the granularity of the clock interval. 
o Using multiple levels of storage 
= This method uses multiple levels of storage and the unique identifiers are structured in 
a hierarchical fashion, one field per level. 
= Acounter maintained at each level contains the highest value of the field assigned and 
the current value is cached in the main memory. 


Oo Oo 
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= In case of a crash at one level, the next higher level is incremented while the low-level 
counter is reset. 

=" Two levels with main memory and stable storage are sufficient for creating a safe and 
efficient unique identifier generator. 


5. Object-Locating Mechanisms 
e The process of mapping an object’s system-oriented unique identifier (UID) to the replica 
locations of the object is called object-locating. 
e This process involves knowing the node where the object is located. 
e Once object-location has succeeded, the object access operation is initiated, which implies 
carrying out the desired operation of read or write on the object. 
1. Broadcast 


O 


O 


In this method, the node with an object request broadcasts the request to all nodes in the 
system. 

The node currently having the requested object replies to the requesting node, as shown 
in Fig 7.11. 


Requesting 
node 


1. Broadcast request message 


2. Reply from the node where 
object is located 
Figure7.11: Object Locating Using Broadcast 


This method is simple and reliable, since it supplies all replica locations of the object. 

But this method is not efficient and scalable, since it generates unnecessary traffic to the 
nodes which do not have the object or its replica. 

Hence, this method is suitable for small systems and those systems having a low frequency 
of object-locating requests. 


2. Expanding Ring Broadcast 


O 


O00 0 


Expanding Ring Broadcast (ERB) is a modification of the pure broadcast method and it is 
used for Wide Area Networks (WAN). 

The WAN network is an internetwork that consists of Local Area Networks (LAN) connected 
by gateways. 

In this method, increasingly distant LANs are searched till the object is found or till every 
LAN searched is unsuccessful. 

This increase is in terms of a hop, which corresponds to a gateway between two processors. 
Processors in a single LAN are zero hop distant. 

A ring is a set of LANs at a specific distance from the requesting processor. 

Fig 7.12 depicts how the ERB works. 
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1. Search at hop 1 
2. Search at hop 2 if 1 fails 
3. Search at hop 3 if 2 fails 


Figure7.12: Object Locating Using ERB 
o Processor A wants to locate object X. 
The request is only broadcast to all processors in the first ring. 
o If a request is received, the search ends successfully; else the search proceeds with a 
broadcast to the next ring of LANs after a timeout period. 
o This continues till we reach the ring in the last periphery of the internetwork. 
3. Encoding object Locating in UID 
o The encoding mechanism of object location uses structured identifiers whose one field 
indicates the location of the object. From the UID, the requesting node extracts the object 


location as shown in Fig 7.13 
(0) UID 
Extract object 
location from UID 


Object Requesting 


location node 


Figure7.13: Encoding object location in UID 
o This method is simple and efficient, but it does not provide object mobility because 
otherwise, the object identifier would change. 
o This method hence requires that the object be fixed to one node throughout its lifetime. 
Also, this method does not directly support multiple replicas of the object. 
o Inshort, this method of object location is limited to those distributed systems that do not 
support object migration and object replication. 
4. Encoding creator node ID in UID 
o This method is based on the principle of locality which states that an object is most likely 
to remain on the node where it was created. 
o This is because object migration is expensive and objects do not migrate frequently. 
o Inthis method, the location field of the structured UID contains the identifier of the node 
on which the object was created. 


oO 


oO 
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Figure7.14: Encoding creator node ID in UID 

o Givena UID, the creator node information is extracted from the UID and a request for the 
object is sent to the creator node. 

o Incase the object has migrated, a failure reply is returned to the requesting node. 

o Then the object is located by broadcasting the request from the client node, as shown in 
Fig 7.14 

o This method is used in Cronus. 

o Ascompared to the broadcasting method, this method reduces network traffic. 

5. Using Forward Location Pointers 

o A forward location pointer is stored at anode when the object is migrated to a new location. 

o The system contacts the creator node of the object and follows the forward chain of 
pointers to the node where the object is currently located. 

o This method avoids excessive message transmission as in the broadcast method, and 
supports object migration facility. 

o One of the drawbacks of this method is that the cost of object location is directly 
proportional to the length of the chain, and the cost increases if the object migration is 
excessive. 


Object 
location 


1, 2, 3: Message forwarding path 


4: Reply from the node where 


Requestin 
re the object is located 


node 


Figure7.15 : Using forward location pointers 

o Incase one of the intermediate pointers is lost or is unavailable due to node failure, it will 
become difficult or impossible to locate the object. 

o This method is shown in Fig 7.15. 

6. Using Hint Cache and Broadcasting. 

o This method maintains a cache on each node that contains a UID and the last known 
location of the recently accessed remote objects. 

o The requesting node first searches the local hint cache for the UID. 

o If the UID search is successful, the location information is extracted from the cache; else 
the request is sent to the node specified in the extracted location information. 

o In the worst case, if the object is not on that node, the request is returned negative, 
implying that the hint cache information is outdated. 
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o In this situation, the request message is broadcast to the entire network to locate the 
current location of the object. 

o Each node now searches internally for the requested object and if found, it returns a 
success message. 

o Since the earlier location information was outdated, this location is now written in the 
requesting node’s hint cache. 

o Wecall the cache entry as a hint, since it always may not be correct. 

o This method is shown in Fig 7.16. 

1. Search local cache 


(8) = Hint cache 
2. Broadcast request message 


= @ Soaanaeaaa aaa aaa = if local search fails 


Object 3. Reply from node where 


location object is located 


Figure7.16 : Using hint cache and broadcasting 
o One of the major advantages of this scheme is that it is efficient if a high degree of locality 
is exhibited in locating objects from a node. 
o One of the drawbacks is maintaining the hint caches on all nodes. 
o The second drawback is that the broadcast requests lead to excessive network traffic, even 
though only one node is involved in an object-locating operation. 


Requesting 
node 


6. Issues in designing human oriented names 
6.1 Scheme for Global Object Naming 


e They include combining an object’s local name with its host name, interlinking isolated 
namespaces into a single namespace, sharing a remote namespace on explicit request, and 
providing a single global namespace of object naming. 
1. Combining an object’s local name with its host name 
o Inthis method, the namespace consists of several isolated namespaces and each of these 
belongs to a node in the distributed system. 

o The object name in this namespace uniquely identifies the object at the node. 

o This naming scheme is simple and easy to implement but is not location-transparent or 
location-independent. 

o Also, the object name would change on migration, making this naming scheme inflexible. 

2. Interlinking isolated namespaces into a single namespace 
o The second naming scheme is to interlink isolated namespaces into a single namespace. 

o The global namespace is composed of several isolated namespaces, but these are joined 
together to form a single naming structure. 

o Inthe naming hierarchy, the position of the component namespaces is arbitrary. 

o This scheme does not have the notion of an absolute pathname; each pathname is relative 
to either the current working context or to the current component namespace. 

o The major advantage is that it is simple to join existing namespaces into a single global 
namespace. 
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Sharing remote namespaces on explicit request 

o This scheme attaches the isolated namespaces of various nodes to create a new 
namespace. 

o Here a single namespace is not created, but the users are given the choice of attaching a 
context of a remote namespace to one of the contexts of a local namespace. 

o Then the objects are named in a location-transparent manner. 

o This scheme allows some degree of sharing, and a request to share a remote namespace 
affects only the node from which the request was made. 

o This scheme has the advantage of providing flexibility to a user and is easy to implement. 

o But this scheme has the disadvantage in that it does not provide a consistent global 
namespace for all nodes. 

Providing a single global namespace of object naming 

o A single global name structure spans all the objects of all the nodes in the distributed 
system. 

o The same namespace is hence visible to all users and an object’s absolute name is always 
the same irrespective of the object location or the user accessing it. 

o Thus, this scheme supports location independency (location of the accessing object or the 
location of the accessed object). 

o This is the most commonly used approach today because of its property of location 
transparency and location independency. 


6.2 Schemes for Partitioning a Namespace into Contexts 


The naming information consists of object names, attributes, and the bindings between them. 
Namespace management involves the storage and maintenance of this naming information. 
Managing this information at a centralized node is not reliable, while replicating it at every 
node will lead to management overheads. 

Hence, a distributed system would use many name servers, each storing a part of the naming 
information. 

The name servers interact among themselves and cater to the name resolution of users’ 
requests. 


6.3 Schemes for Implementing Context Bindings 


The context of a namespace is distributed among multiple name servers that manage the 
namespace. 

A name server hence stores only a small subset of all the contexts of the namespace. 

When a sure requests a name resolution, a name server looks into its local context for an 
authority attribute of an object. 

If an authority attribute is not found in the local server, the name server searches for additional 
configuration data called context bindings. 

A context binding associates the context within which it is stored to another context and the 
name server that stores that context. 

The two strategies commonly used to implement context bindings: 
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1. Table based strategy 


oO 


O 


It is a commonly used approach for implementing context bindings of hierarchical 
namespaces. 

In this method, each context is a table with two fields: the component name of the named 
object and the context binding information of authority attribute information. 

The amount of configuration data that must be stored in context objects at various levels 
is proportional to the degree of hierarchical levels. 

The contexts of this strategy are also called directories. 


2. Procedure-based strategy 


oO 


O 


In this method, a context binding is in the form of a procedure which when executed, gives 
information about the next context to be referred for the named object. 

The context bindings cannot be changed dynamically and independently for each object 
and hence this scheme is inflexible. 


6.4 Schemes for Name Resolution 


O 


O 
O 


Name resolution is the process of mapping an object’s name to the authoritative name 
servers of the object. 

This process is a traversal of the resolution chain of contexts till the authority attribute of 
the named object is reached. 

Name resolution depends on the policy used for distributing the contexts of namespaces. 
The commonly used name resolution mechanisms are. 


1. Centralized Approach 


O 


O 


In this method, a single name server is used for name resolution of the entire distributed 
system and this is located in the centralized node. 

This server stores all the contexts and maintains the name mapping information of the 
entire namespace. 

The location of the name server is known to all nodes. 

Hence, a request for name resolution is sent to this centralized name server’s node, which 
resolves the name by traversing the complete chain of contexts and returns the attributes 
of the named object to the requesting node. 

Domain Namespace System is a typical example of the centralized approach. 


2. Fully replicated approach 


O 


This method uses a name server on each node and all contexts are replicated on these name 
servers. 

This implies that the namespace database is fully replicated and hence all requests for name 
resolution are serviced locally by object names. 

This method is simple, efficient, and avoids inter-node communication for name resolution. 
However this method has a disadvantage in that a large overhead is required to maintain 
consistency of naming information. 

This method is hence not suitable for large namespaces. 

The Pup name service and the Distributed Double-Loop Computer network Name service 
use this approach. 
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3. Distribution based on the physical structure of namespace 

o The hierarchical namespace is based on this approach. 

o In this method, the namespace tree is subdivided into many subtrees that are named 
differently in different systems. 

o These subtrees are referred as zones. 

o These are called domains in Sprite and file groups in LOCUS. 

o This method uses many name servers and each name server provides storage for one or 
more zones. 

o Contexts belonging to the same zone can be grouped together. 

o Zones provide indivisible units of storage and replication regarding named objects. 

o The name resolution process involves sending the request to the server that manages the 
subtree to which the last component of the name belongs to. 

4. Structure-free distribution of contexts 

o Inthis method, any context of any namespace can be freely stored or moved at any name 
server independently of any other context. Here each context belongs to its own zone. 

o This method permits flexibility and simplifies name management, since the name servers 
need not agree on what zones make up the namespace. 

o Apolicy is required for partitioning and distribution of contexts. 

o Name resolution is carried out by traversing the complete resolution chain of contexts on 
a context-by-context basis till the object’s authoritative name server is encountered. 

o Each context can be freely put on any server, so name resolution means hopping from one 
node to another if the servers involved are located on different nodes. 


7. Name caches 


7.1 Characteristics of Name Service Activities 
1. High degree of locality of name lookup 
o. This is similar to the property of locality of reference observed in program execution, file 
access, database access, etc. 
o Ahigh degree of locality exists in using pathnames for accessing objects. 
o The hit ratio can be increased by providing a reasonable size name cache that caches 
recently used naming information. 
2. Slow update of name information database 
o Users are mostly confined to the namespace which is only a small subset of the slowly 
changing name information database. 
o. Also, the naming data has a high read-to-modify ratio. 
o This means that the cost of maintaining the consistency of cached data is relatively low. 
3. On-use consistency of cached information is possible 
o Incase there is access to obsolete naming data, it will not work and hence it can be attended 
to at the time of use. 
o Soname consistency can be maintained by detecting and discarding stale cache entries on 
use. 
o Thus, there is no need to invalidate all related cache entries when a naming update occurs 
and stale data does not cause mapping name to a wrong object. 
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7.2 Issues in Name Cache Design 
e There are various issues involved in name cache design : 
o Types of name caches 
o Approaches to name cache implementation 
o Multi-cache consistency 
1. Types of Name caches 
o A name cache stores the name resolution information about recently accessed objects’ 
names. 
o Name caches can be categorized depending on the type of information stored in each entry. 
o Directory cache: 
= In this type of cache, each entry consists of a directory page and it is mostly used in the 
iterative method of name resolution. 
= All recently used directory pages that are brought to the client node during name 
resolution are cached for some time. 
o Prefix cache 
= This type of cache is basically used in systems with zone-based context distribution 
mechanism. 
= Each entry consists of aname prefix and the zone identifier corresponding to that zone. 
=" This type of cache is not suitable in those systems that use structure-free context 
distribution mechanisms. 
o Full-name cache 
=" Each entry in this type of cache consists of an object’s full path name and the identifier 
and the location of its authoritative name server. 
=" Hence the requests for accessing an object that is available in the local cache can be 
sent directly to the authoritative name server. 
= The full name cache is suitable for all naming mechanisms. 
2. Approaches to name cache implementation 
o Aname cache can be implemented using two approaches: 
=" a cache-per-process 
=" a single cache for all processes 
o In the first approach, each cache is maintained per process and in the process’ address 
space. 
o Since it is small in size, information can be accessed faster and the OS memory area does 
not remain occupied with this information. 
o This name cache is deleted once the process is terminated and for every new process, a 
new name cache is created. 
o Incase the process is short-lived, the hit ratio will be low due to start-up misses that occur 
when a new name cache is created. 
o Itis also possible that the same naming information may be duplicated in many caches on 
the same node. 
o Start-up misses are reduced in the V-system by enabling the child process to inherit the 
initial contents of the parent process. 
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o The other approach is to maintain a single name cache at each node for all the processes at 
that node. 
o These caches are hence larger in size as compared to per-process name cache and are 
located in the OS’s memory. 
o Thecache information is long-lived, since it uses standard replacement policies when there 
is a need to evacuate the cache, leading to a higher hit ratio. 
o This method also avoids the problem of duplicating the naming information. 
3. Multi Cache Consistency 
o Whenever a naming data is updated, the relative name cache entries become stale and 
they need to be either updated or invalidated. 
o Two commonly used methods to maintain multi-cache consistency are immediate 
invalidate and on-use update. 
o Immediate invalidate 
=" As the name suggests the naming information is immediately invalidated when the 
naming update occurs. 
= It is done in two ways: broadcast method and multicast method. 
= Inthe first approach, the invalidation message is broadcast to all nodes in the system. 
=" Each node examines its own name caches and invalidates cache entry if it has that data. 
= This method increases network traffic overhead. 
= In the second approach, a storage node keeps a list of nodes against each data that 
corresponds to the nodes on which the data is cached. 
= A storage node receives an update request; it checks its list and the corresponding 
nodes are notified to invalidate their cache entries. 
o On-use update 
= It isa commonly used method for maintaining naming consistency. 
= When a client uses a stale entry from a cache, it is informed by the naming system that 
the data being used is stale. 
=" Once this message is received, either using broadcast or multicast, the updated data is 
sent to refresh the stale cache entry. 


8. Naming and security 
e Various naming-related access control mechanisms are object names as protection keys, 
capabilities, and associating protection with name resolution path. 


1. Object names as protection keys 


o An object’s name is used as a protection key for the object. 

o Auser who knows the name of an object can access the object. 

o An object can have several names, meaning several keys and any one of the keys can be 
used to access the object. 

o The users are not allowed to name the objects that they are not authorized to access. Also 
it is not possible for a user to access another user’s objects. 
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2. Capabilities 
Object Rights 
identifier information 
Figure7.17:Capability 
o Capability is a special type of identifier that contains additional information for protection, 


O 


in one or more of the permission modes. 

These capabilities can have many properties. 

A capability is a system-oriented name that uniquely identifies an object. 

It also protects the object it references by defining operations that may be performed on 
the object it identifies. 

A client that possesses a capability can access the object that is identified by it in the modes 
allowed by it. 

An object has several capabilities and the same capability held by different holders provides 
the same access rights to all of them. 

An object is shareable among clients if all of them have the capabilities to the object. 
Capabilities are protected objects that are maintained by the OS and are indirectly accessed 
by the users. 

To perform an operation on an object, a process sends a message to the name server 
containing the object’s capability. 

Based on this information, the name server verifies the information and allows the type of 
operation requested by the client on the relevant object. 

Otherwise, a ‘permission denied’ message is sent to the client. 


3. Associating protection with name resolution path 


oO 


There are two ways in which protection can be associated to identify an object: with the 
object or with the name resolution path. 

The name resolution path method uses access control list (ACL) based protection that 
controls access based on user identity. 

This trusted identifier must be a password, address, or any other identifier form that cannot 
be forged. 

An ACL specifies the user name and the types of access allowed for each user of that object. 
When a user requests an object, the OS checks the ACL for that object and if listed, only 
then the access is allowed. 

If not, a protection violation occurs. 

Done for both named objects and the information in the database. 

When a name server receives a request for a directory, it verifies first whether access is 
allowed or not. 

Name servers thus do not provide access to unauthorized clients and do not accept 
unauthorized updates to naming information stored in naming database. 

To allow access to an object, a user must be allowed access to both directories of the 
object’s pathname and the object itself. 
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9. DNS 


e Domain Name Service (DNS) is widely used to access the Internet. 


DNS server data 
url ‘a’ host—a.example. microsoft.com «4 


192-172-1:20 
What is the address of ‘a’ 


DNS dient 192°172:1:20 f J 


DNS server 
Figure7.18: Using DNS 

e It is an Internet directory service that provides a way to map the user-friendly name of a 
computer or service to its numeric address. 

e The DNS defines and describes how domain names are translated into IP addresses, and how 
it controls email delivery. 

e Aclient computer queries a DNS server, asking for the IP address of a computer configured to 
use host-a.example.microsoft.com as its DNS domain name. 

e The DNS server answers the query based on its local database and replies with an answer 
containing the requested information, which contains the IP address information for host- 
a.example.microsoft.com. 

e The DNS system consists of three components: 

o DNS data called resource records 
o Servers called name servers 
o Internet protocols for fetching data from the servers 

e The billions of resource records on the Internet are split into millions of files called zones. 

e Zones are kept on authoritative servers distributed all over the Internet that answer queries 
based on the resource records stored in the zones they have copies of. 

e To standardize DNS naming across the Internet, different organizations were assigned 
authoritative controls of top-level domains. The original seven top-level domains are: 

com (commercial organizations) 

edu (educational organizations) 

gov (government organizations) 

mil (military organizations) 

net (networking organizations) 

org (noncommercial organizations) 

o int (International organizations) 

e Adomain name usually consists of two or more parts and is conventionally written separated 

by dots, such as example.com. 


O00 0 0 0 
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Figure7.19:DNS topology 

e The rightmost label conveys the top-level domain (for example, the address www.example.com 
has the top-level domain com). 

e Each label to the left specifies a subdivision or the sub-domain of the domain above it. 

e The DNS is maintained by a distributed database system that uses the client-server model. The 
nodes of this database are the name servers. 

e Each domain or sub-domain has one or more authoritative DNS servers that publish 
information about that domain and the name servers of any domains subordinate to it. 

e The top of the hierarchy is served by the root name servers: the servers to query when looking- 
up (resolving) a top-level domain name (TLD). 

e For example, a DNS recursor consults three name servers to resolve the address 
www.wikipedia.org. 

e Most name servers are authoritative for some zones and perform a caching function for all 
other DNS information. 

e Large name servers are often authoritative for tens of thousands of zones, but most name 
servers are authoritative for just a few zones. 
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1. Architecture 
1.1 Traditional Web Based system 

e Many Web-based systems are organized as simple client-server architectures. 

e The core of a Web site is formed by a process that has access to a local file system storing 
documents. 

e The simplest way to refer to a document is by means of a reference called a Uniform Resource 
Locator (URL). 

e It specifies where a document is located, often by embedding the DNS name of its associated 
server, along with a file name by which the server can look up the document in its local file 
system. 

e A URL specifies the application-level protocol for transferring the document across the 


network. 


2. Server fetches 
Client machine Server machine document from 
local file 


Web server 


at ea 


z 3. Response ‘i 


1. Get document request (HTTP) 
Figure8.1: The overall organization of a traditional Web site. 


e Aclient interacts with Web servers through a special application known as a browser. 
e A browser is responsible for properly displaying a document. 
e A browser accepts input from a user mostly by letting the user select a reference to another 
document, which it then subsequently fetches and displays. 
e The communication between a browser and Web server is standardized: they both adhere to 
the HyperText Transfer Protocol (HTTP) 
e This leads to the overall organization shown in Fig. 8.1. 
Web Documents 
o Fundamental to the Web is that all information comes in the form of a document. 
o Most documents can be divided into two parts: 
=" Amain part that at the acts as a template. 
= The second part consists of many different bits and pieces that jointly constitute the 
document that is displayed in a browser. 
o The main part is generally written in a markup language. 
o The most widely-used markup language in the Web is HTML (Hyper Text Markup 
Language). 
o Other important markup language is Extensible Markup Language or XML. 
o HTML and XML can also include all kinds of tags that refer to embedded documents, 
that is, references to files that should be included to make a document complete. 
o Each embedded document has an associated MIME (Multipurpose Internet Mail 


Browser 
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Exchange) type. 

© It was developed to provide information on the content of a message body that was sent 
as part of electronic mail. 

o MIME makes a distinction between top-level types and subtypes. 


Type Subtype Description 
Text | Plain Unformatted text 
| HTML Text including HTML markup commands 
XML Text including XML markup commands 
Image | GIF Still image in GIF format 
JPEG Still image in JPEG format 
Audio Basic Audio, 8-bit PCM sampled at 8000 Hz 
| Tone A specific audible tone 
Video | MPEG Movie in MPEG format 
| Pointer Representation of a pointer device for presentations 
Application | Octet-stream | An uninterpreted byte sequence 
Postscript A printable document in Postscript 
PDF A printable document in PDF 
Multipart | Mixed Independent parts in the specified order 
Parallel Parts must be viewed simultaneously 


Table 8.1: Six top-level MIME types and some common subtypes 

o Some common top-level types are shown in Table 8.1 and include types for text, image, 
audio, and video. 

o There is a special application type that indicates that the document contains data that are 
related to a specific application. 

2. Multitiered Architectures 

o Common Gateway Interface or simply CGI defines a standard way by which a Web server 
can execute a program taking user data as input. 

o Usually, user data come from an HTML form; it specifies the program that is to be 
executed at the server side, along with parameter values that are filled in by the user. 

3. Start process to fetch document 


1. Get request 


HTTP 4. Database interaction 


request 
handler 


CGI 
program 


6. Return result 


5. HTML document 
created 


Web server Database server 


CGI process 
Figure 8.2: The principle of using server-side CGI programs. 


o Once the form has been completed, the program's name and collected parameter values 
are sent to the server, as shown in Fig. 8.2 
o When the server sees the request it starts the program named in the request and passes it 
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the parameter values. 

o Program simply does its work and generally returns the results in the form of a document 
that is sent back to the user's browser to be displayed. 

o As server-side processing of Web documents increasingly requires more flexibility, it 
should come as no surprise that many Web sites are now organized as a three-tiered 
architecture consisting of a Web server, an application server, and a database. 

o The Web server is the traditional Web server application server that runs all kinds of 
programs and that may or may not access the third tier, consisting of a database. 


1.2 Web Services 


Client machine Server machine 
Look up 
a service Server Publish service 
el application i 
Stub 
Communication SOAP Communication 
subsystem subsystem 

Generate stub Generate stub 
from WSDL from WSDL 
description description 


Directory service (UDDI) 
Figure8.3: The principle of a Web service. 


o The principle of providing and using a Web service is quite simple, and is shown in Fig8.3. 

o. The basic idea is that some client application can call upon the services as provided by a 
server application. 

o Standardization takes place with respect to how those services are described such that 
they can be looked up by a client application. 

o In addition, we need to ensure that service call proceeds along the rules set by the server 
application. 

o Animportant component in the Web services architecture is formed by a directory service 
storing service descriptions. 

© Universal Description, Discovery and Integration standard (UDDI) prescribes the layout of 
a database containing service descriptions that will allow Web service clients to browse 
for relevant services. 

o Services are described by means of the Web Services Definition Language (WSDL). 

o A WSDL description contains the precise definitions of the interfaces provided by a 
service, that is, procedure specification, data types, the (logical) location of services, etc. 

o Acore element of a Web service is the specification of how communication takes place. 

o The Simple Object Access Protocol (SOAP) is used, which is essentially a framework in 
which much of the communication between two processes can be standardized. 
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2. Processes 
2.1 Client 
e The most important Web client is a piece of software called a Web browser, which enables a 
user to navigate through Web pages by fetching those pages from servers and subsequently 
displaying them on the users screen. 


User interface 


Browser engine 
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Rendering engine 
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interpreter 
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Figure8.4: The logical components of a Web browser. 

e Web browser consist of several components, shown in Fig8.4 

e The core of a browser is formed by the browser engine and the rendering engine. 

e The rendering engine contains all the code for properly displaying documents. 

e This rendering require parsing HTML or XML, but may also require script interpretation. 

e In most case, there is only an interpreter for Javascript included, but in theory other 
interpreters may be included as well. 

e The browser engine provides the mechanisms for an end user to go over a document, select 
parts of it, activate hyperlinks, etc. 

e Another client-side process is a Web proxy. 


FTP request 
Browser an Web proxy ai FTP server 
HTTP response FTP response 


Figure8.5: Using a Web proxy when the browser does not speak FTP. 
e Originally, such a process was used to allow a browser to handle application-level protocols 
other than HTTP, as shown in Fig 8.5. 
e For example, to transfer a file from an FTP server, the browser can issue an HTTP request to a 
local FTP proxy, which will then fetch the file and return it embedded as HTTP. 


2.2 The Apache Web Server 


e This runtime environment, known as the Apache Portable Runtime (APR), is a library that 
provides a platform-independent interface for file handling, networking, locking, threads, 
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Figure8.6: The general organization of the Apache Web server. 


Its overall organization is shown in Fig 8.6. 

Fundamental to this organization is the concept of a hook, which is nothing but a placeholder 
for a specific group of functions. 

The Apache core assumes that requests are processed in a number of phases, each phase 
consisting of a few hooks. 

Each hook thus represents a group of similar actions that need to be executed as part of 
processing a request. 

For example, there is a hook to translate a URL to a local file name. 

Such a translation will almost certainly need to be done when processing a request. 

Likewise, there is a hook for writing information to a log, a hook for checking a client's 
identification, a hook for checking access rights, and a hook for checking which MIME type the 
request is related to. 

As shown in Fig 8.6, the hooks are processed in a pre- determined order. 

The functions associated with a hook are all provided by separate modules. 

A module developer will write functions for specific hooks. 

When compiling Apache, the developer specifies which function should be added to which 
hook. 

The latter is shown in Fig. 8.6 as the various links between functions and hooks 

As there may be tens of modules, each hook will generally contain several functions. 

Modules are considered to be mutual independent, so that functions in the same hook will be 
executed in some arbitrary order. 

Apache can also handle module dependencies by letting a developer specify an ordering in 
which functions from different modules should be processed. 


Functions called per hook 


2.3 Web Server Clusters 


Web server can easily become overloaded. 
Solution to this is to replicate a server on a cluster of servers and use a separate mechanism, 
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such as a front end, to redirect client requests to one of the replicas. 
Web Web Web Web 
server server server server 


LAN 
Front end handles 
all incoming requests 
and outgoing responses 


Request Response 
Figure 8.7: The principle of using a server cluster in combination with a front end to implement a Web service. 


e This principle is shown in Fig 8.7, and is an example of horizontal distribution. 

e Here the design of the front end becomes a serious performance bottleneck. 

e Hence a distinction is made between front ends operating as transport layer switches, and 
those that operate at the level of the application layer. 

e Whenever a client issues an HTTP request, it sets up a TCP connection to the server. 

e A transport-layer switch simply passes the data sent along the TCP connection to one of the 
servers, depending on some measurement of the server's load. 

e The response from that server is returned to the switch, which will then forward it to the 
requesting client. 

e The main drawback of a transport-layer switch is that the switch cannot take into account the 
content of the HTTP request that is sent along the TCP connection. 
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Figure8.8: A scalable content-aware cluster of Web servers. 

e At best, it can only base its redirection decisions on server loads. 

e A better approach is to deploy content-aware request distribution, by which the front end 
first inspects an incoming HTTP request, and then decides which server it should forward that 
request to. 

e A problem with content-aware distribution is that the front end needs to do a lot of work. 

e In combination with TCP handoff, the front end has two tasks. 

e First, when a request initially comes in, it must decide which server will handle the rest of the 
communication with the client. 

e Second, the front end should forward the client's TCP messages associated with the handed- 
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off TCP connection. 

e These two tasks can be distributed as shown in Fig 8.8. 

e The dispatcher is responsible for deciding to which server a TCP connection should be handed 
off; a distributor monitors incoming TCP traffic for a handed-off connection. 

e The switch is used to forward TCP messages to a distributor. 

e When a client first contacts the Web service, its TCP connection setup message is forwarded 
to a distributor, which in turn contacts the dispatcher to let it decide to which server the 
connection should be handed off. 

e At that point, the switch is notified that it should send all further TCP messages for that 
connection to the selected server. 


3. Communication 
3.1 Hypertext Transfer Protocol 
e All communication in the Web between clients and servers is based on the Hypertext Transfer 
Protocol (HTTP). 
e HTTP is a relatively simple client-server protocol: a client sends a request message to a server 
and waits for a response message. 
e It does not have any concept of open connection and does not require a server to maintain 
information on its clients. 
1. Http Connections 
o HTTP is based on TCP. 
o Whenever a client issues a request to a server, it first sets up a TCP connection to the 
server and then sends its request message on that connection. 
o The same connection is used for receiving the response. 
o By using TCP as its underlying protocol, HTTP need not be concerned about lost requests 


and responses. 
Client 


References 


' ' ' 
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TCP connection TCP connection 


(a) (b) 


Figure8.9: (a) Using nonpersistent connections. (b) Using persistent connections. 
o In HTTP version 1.0 and older, each request to a server required setting up a separate 
connection, as shown in Fig. 8.9(a). 
o When the server had responded, the connection was broken down again. 
Such connections are referred to as being nonpersistent. 
o Amajor drawback of nonpersistent connections is that it is relatively costly to set up a TCP 
connection. 


Oo 
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o Another approach that is followed in HTTP version1.1 is to make use of a persistent 
connection Fig 8.9(b), which can be used to issue several requests, without the need for a 
separate connection for each pair. 

2. Http Methods. 

o Aclient can request each of these operations to be carried out at the server by sending a 
request message containing the operation desired to the server. 

o Alist of the most commonly used request messages is given in Fig. 8.2. 


Operation _ Description 

| Head _ Request to return the header of a document 

| Get | Request to return a document to the client 

“Put | Request to store a document 

' Post | Provide data that are to be added to a document (collection) 
Delete Request to delete a document 


Table8.2: Operations supported by HTTP. 
3. Http Messages 
o All communication between a client and server takes place through messages. 


o HTTP recognizes only request and response messages. 
Delimiter 


Operation || Reference || Version _| 
Message headername|| Value si 
| 


Message header name| | 


: Request message headers 


e 
Message header name] | Value | 


Request line 


Message body 


(a) 
Figure8.10(a): HTTP request message 
o Arequest message consists of three parts, as shown in Fig. 8.10(a). 
o The request line is mandatory and identifies the operation that the client wants the server 
to carry out along with a reference to the document associated with that request. 
o A separate field is used to identify the version of HTTP the client is expecting. 
o Aresponse message starts with a status line containing a version number and also a three- 
digit status code, as shown in Fig. 8.10(b). 
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| Version |] Status code |] _—Phrase__—([] Status line 

Message header name | Value — |] 

Message header name | Value —s 
e 


Response message headers 


e 
Message header name | Value ] 


Message body 


b 
Figure 8.10(b): HTTP era message. 
© For example, status code 200 indicates that a request could be serviced, and has the 
associated phrase "OK." 
o Other frequently used codes are: 400 (Bad Request) ,403 (Forbidden), 404 (NotFound). 
o Table 8.3 shows a number of valid message headers that can be sent along with a request 
or response. 


Header Source Contents 
Accept | Client | The type of documents the client can handle 
Accept-Charset | Client | The character sets are acceptable for the client 
Accept-Encoding | Client | The document encodings the client can handle | 
Accept-Language | Client | The natural language the client can handle 
Authorization | Client | A list of the client's credentials 
WWW-Authenticate | Server | Security challenge the client should respond to 
Date | Both | Date and time the message was sent 
ETag Server | The tags associated with the returned document 
Expires | Server | The time for how long the response remains valid 
From | Client | The client's e-mail address 
Host Client | The DNS name of the document's server 
If-Match Client | The tags the document should have 
If-None-Match Client | The tags the document should not have 
If-Modified-Since Client | Tells the server to return a document only if it has 


| been modified since the specified time 


f-Unmodified-Since | Client | Tells the server to return a document only if it has not 
been modified since the specified time 


Last-Modified Server | The time the returned document was last modified 
Location Server | Adocument reference to which the client should 

| redirect its request 
Referer Client | Refers to client's most recently requested document 
Upgrade Both | The application protocol the sender wants to switch to 
Warning Both | Information about the status of the data in the message 


Table8.3: Some HTTP message headers. 
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3.2 Simple Object Access Protocol 

e The Simple Object Access Protocol (SOAP) forms the standard for communication with Web 
services. 

e SOAP messages are largely based on XML. 

e A SOAP message generally consists of two parts, which are jointly put inside what is called a 
SOAP envelope. 

e The body contains the actual message, whereas the header is optional, containing 
information relevant for nodes along the path from sender to receiver. 

e Such nodes consist of the various processes in a multi-tiered implementation of a Web 
service. 

e Everything in the envelope is expressed in XML. 

e When a SOAP message is bound to HTTP, the recipient will be specified in the form of a URL, 
whereas a binding to SMTP will specify the recipient in the form of an e- mail address. 


<env:Envelope xmins:env="http://www.w3.org/2003/05/soap-envelope"> 
<env:Header> 
<n:alertcontrol xmins:n="http://example.org/alertcontrol"> 
<n:priority>1</n:priority> 
<n:expires>2001-06-22T 14:00:00-05:00</n:expires> 
</n:alertcontrol> 
</env:Header> 
<env:Body> 
<m:alert xmins:m="http://example.org/alert"> 
<m:msg>Pick up Mary at school at 2pm</m:msg> 
</m:alert> 
</env:Body> 
</env:Envelope> 


Figure8.11: An example of an XML-based SOAP message. 


4. Naming 

e The Web uses a single naming system to refer to documents. 

e The names used are called Uniform Resource Identifiers or simply URIs. 

e A Uniform Resource Locator (URL) is a URI that identifies a document by including information 
on how and where to access the document. 

e A URL is used as a globally unique, location-independent, and persistent reference to a 
document. 

e URL often contain information on how and where to access a document. 

e How to access a document is generally reflected by the name of the scheme that is part of the 
URL, such as http, ftp, or telnet. 

e Where a document is located is embedded in a URL by means of the DNS name of the server 
to which an access request can be sent, although an IP address can also be used. 

e The number of the port on which the server will be listening for such re- quests is also part of 
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the URL. 
e URL also contains the name of the document to be looked up by that server, leading to the 


general structures shown in Fig. 8.12 
[ Scheme | Host name | Pathname al 
http) :// www.cs.vu.nl /home/steen/mbox 


(a) 


Scheme Host name | Port | Pathname 
http // www.cs.vu.nl : 80 /home/steen/mbox 


(b) 


Scheme i Host name | Port Pathname ] 


http // 130.37.24.11 : 80 /home/steen/mbox 


(c) 
Figure8.12: Often-used structures for URLs. (a) Using only a DNS name. (b) Combining a DNS name with a port 
number. (c) Combining an IP address with a port number. 


e Resolving a URL such as those shown in Fig. 8.12 is straightforward. 

e If the server is referred to by its DNS name, that name will need to be resolved to the server's 
IP address. 

e Using the port number contained in the URL, the client can then contact the server using the 
protocol named by the scheme, and pass it the document's name that forms the last part of 
the URL. 


5. Synchronization 

e Synchronization has not been much of an issue for most traditional Web- based systems for 
two reasons. 

e First, the strict client-server organization of the Web, in which servers never exchange 
information with other servers means that there is nothing much to synchronize. 

e Second, the Web can be considered as being a read-mostly system. 

e Distributed authoring of Web documents is handled through a separate protocol, namely 
WebDAV. 

e WebDAV stands for Web Distributed Authoring and Versioning and provides a simple means 
to lock a shared document, and to create, delete, copy, and move documents from remote 
Web servers. 

e To synchronize concurrent access to a shared document, WebDAV supports a simple locking 
mechanism. 

e There are two types of write locks. 

o An exclusive write lock can be assigned to a single client, and will prevent any other client 
from modifying the shared document while it is locked. 
o Ashared write lock, which allows multiple clients to simultaneously update the document. 

e Because locking takes place at the granularity of an entire document, shared write locks are 
convenient when clients modify different parts of the same document. 

e However, the clients, themselves, will need to take care that no write-write conflicts occur. 
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e Assigning a lock is done by passing a lock token to the requesting client. 

e The server registers which client currently has the lock token. 

e Whenever the client wants to modify the document, it sends an HTTP post request to the 
server, along with the lock token. 

e The token shows that the client has write-access to the document, for which reason the 
server will carry out the request. 


6. Web Proxy Caching 

e Web proxy accepts requests from local clients and passes these to Web servers. 

e When aresponse comes in, the result is passed to the client. 

e The advantage of this approach is that the proxy can cache the result and return that result to 
another client, if necessary. 

e In other words, a Web proxy can implement a shared cache. 

e In addition to caching at browsers and proxies, it is also possible to place caches that cover a 
region, or even a country, thus leading to hierarchical caches. 

e Such schemes are mainly used to reduce network traffic. 

e But has the disadvantage of incurring a higher latency compared to using non-hierarchical 
schemes. 

e Asan alternative to building hierarchical caches, one can also organize caches for cooperative 
deployment as shown in Fig. 8.13. 


3. Forward request 
to Web server 


1. Look in 
local cache 


2. Ask neighboring proxy caches 


HTTP Get request 


Figure8.13: The principle of cooperative caching. 

e In cooperative caching or distributed caching, whenever a cache miss occurs at a Web proxy, 
the proxy first checks a number of neighbouring proxies to see if one of them contains the re- 
quested document. 

e If such a check fails, the proxy forwards the request to the Web server responsible for the 
document. 

e This scheme is primarily deployed with Web caches belonging to the same organization or 
institution that are collocated in the same LAN. 
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7. Replication for Web Hosting Systems 
e There are essentially three different kinds of aspects related to replication in Web hosting 
systems: metric estimation, adaptation triggering, and taking appropriate measures. 
1. Metric Estimation 


O 


O 
O 


Latency Metrics : By which the time is measured for an action to take place for example , 
fetching a document 

Spatial Metrics: It mainly consists of measuring the distance between nodes in terms of 
the number of network-level routing hops, or hops between autonomous systems. 
Network Usage Metrics: It computes consumed bandwidth in terms of the number of 
bytes to transfer. 

Consistency Metrics: It tell us to what extent a replica is deviating from its master copy. 
Financial Metrics: It is closely related to the actual infrastructure of the Internet. 


2. Adaptation Triggering 


oO 


O 


Important question that needs to be addressed is when and how adaptations are to be 
triggered. 

A simple model is to periodically estimate metrics and subsequently take measures as 
needed. 

Special processes located at the servers collect information and periodically check for 
changes. 


3. Adjustment Measures 
o There are essentially only three measures that can be taken to change the behaviour of a 


Web hosting service: 

= changing the placement of replicas, 

=" changing consistency enforcement, 

=" Deciding on how and when to redirect client requests. 


8. Replication of Web Applications 


Edge-server side Origin-server side 
Client 
Server 
— 
response 
Content-blind Database 
cache copy | 


full/partial data replication 


Content-aware \ full schema replication/ Authoritative 


cache ae: : “ 
ee ' query templates |_[ Schema) database 


Figure8.14: Alternatives for caching and replication with Web application 


e When considering improving performance of Web applications through caching and 
replication, matters are complicated. 
e Let us consider the edge-server situation as sketched in Fig. 8.14. 
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e In this case, we assume a CDN, in which each hosted site has an origin server that acts as the 
authoritative site for all read and update operations. 

e An edge server is used to handle client requests, and has the ability to store (partial) 
information as also kept at an origin server. 

e Web clients request data through an edge server, which, in tum, gets its information from the 
origin server associated with the specific Web site referred to by the client. 

e As also shown in Fig. 8.14, we assume that the origin server consists of a database from which 
responses are dynamically created. 

e Toimprove performance, we can apply full replication of the data stored at the origin server. 

e This scheme works well whenever the update ratio is low and when queries require an 
extensive database search. 

e Replicating for performance will fail when update ratio is high. 

e Alternative is partial replication in which only a subset of the data is stored at the edge server 
may suffice. 

e The problem with partial replication is that it may be very difficult to manually decide which 
data is needed at the edge server. 

e Other options is to caching , we can do it by two ways: 

e Content aware cache: 

o With content-aware caches, an edge server maintains a database that is organized 
according to the structure of queries. 

o The queries are assumed to adhere to a limited number of templates, effectively meaning 
that the different kinds of queries that can be processed is restricted. 

o Whenever a query is received, the edge server matches the query against the available 
templates, and subsequently looks in its local database to compose a response, if possible. 

o If the requested data is not available, the query is forwarded to the origin server after 
which the response is cached before returning it to the client. 

o Thus data at edge server needs to be kept consistent. 

o The origin server needs to know which records are associated with which templates, so 
that any update of a record, or any update of a table, can be properly addressed by, for 
example, sending an invalidation message to the appropriate edge servers. 

e Content binding cache : 

o When aclient submits a query to an edge server, the server first computes a unique hash 
value for that query. 

o Using this hash value, it subsequently looks in its cache whether it has processed this 
query before. 

o If not, the query is forwarded to the origin and the result is cached before returning it to 
the client. 

o If the query had been processed before, the previously cached result is returned to the 
client. 

o The main advantage of this scheme is the reduced computational effort that is required 
from an edge server in comparison to the database approaches 
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1. Introduction of Security in Distributed OS 

e Security in distributed systems can roughly be divided into two parts. 

e One part concerns the communication between users or processes, possibly residing on 
different machines. 

e The principal mechanism for ensuring secure communication is that of a secure channel. 

e Secure channels, and more specifically, authentication, message integrity, and confidentiality. 

e The other part concerns authorization, which deals with ensuring that a process gets only those 
access rights to the resources in a distributed system it is entitled to. 


2. Overview of security techniques ,features 
Zed aid a odie 
Cryptography is defined as a means of protecting private information against unauthorized 
access in cases where physical security is difficult to achieve. 
e Cryptography is carried out using two basic operations: encryption and decryption. 
e The former is the process of transforming intelligible information (plaintext) into unintelligible 
form (cipher text). 
e Decryption is the reverse process, i.e. the process of transforming the information from cipher 
text to plaintext. 
e The plaintext message is encrypted into cipher text, transmitted, and then reconverted, as 
shown in Fig 9.1 


Side Eve Plaintext 
information M 


Bob 


Plainte xt 
M 


Plainte xt Alice 


MM 


Ciphertext ¢ 


Encryption Decryption 
key Ky key Ky 


Figure9.1:Basic Cryptographic System 
e The encryption algorithm has the following form: 
C= E(P,Ke) 
where P = plaintext to be encrypted Ke= encryption key C = resulting cipher text 
e The decryption algorithm is performed by the same matching function which has the following 
form: 
P =D(C,Ke) 
where D = cipher text to be decrypted Kg = decryption key P = resulting plaintext 
e To protect plaintext from revelation, it is possible to keep on varying the keys and get different 
cipher text. 
e There are two broad classes of cryptosystems based on whether the encryption and decryption 
keys are the same, viz. symmetric and asymmetric systems. 
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1. Symmetric Cryptosystem 
o Asymmetric cryptosystem uses the same key for both encryption and decryption. 
o It is necessary that the key be easily alterable if required and is always kept secret. 
o This implies that the key is known only to authorized users. 
o Symmetric cryptosystems are also called as shared key or private key cryptosystems since 
both sender and receiver share the same key. 
P=D;(Ex(P)) 
o The key shared by both A and B is denoted as Kaz. 


Encryption 
method 


Encryption 
key, K 


Plaintext, P Plaintext 


Ciphertext 
° ™ Decryption 


key, K 


Symmetric 
key generator 


Sequre channels with 
confidentiality and authentication 


Figure9.2: Secret Key Distribution 
o These systems prove useful where both encryption and decryption are performed by a 
trusted system as shown in Fig 9.2 
o For example, a password-based authentication system uses this scheme for saving pass- 
words in encrypted form. 
o During authentication, the OS uses the same key to decrypt the stored password to 
compare it with the password given by the user 
2. Asymmetric cryptography 
o Inasymmetric cryptosystem, the keys used for encryption and decryption are different but 
they form a unique pair. 
o There are separate keys for encryption (Ke) and decryption (Ka). 
P = Dxo(Exe(P)) 
o One key in the asymmetric cryptosystem is kept private while the other one is made public. 
o Hence these types of cryptosystems are referred to as public key systems. 
3. Hash Functions 
o Using hash functions for cryptography is one more application in distributed systems. 
o Ahash function H takes a message m of any length as input and produces a bit string h with 
fixed length: 
h=H(m) 


3. Access Control 
e When a secure channel is set up between a client and a server, the client can issue requests to 
the server. 
e A request can be carried out only if the client has sufficient access rights for that invocation. 
e Verifying access rights is called access control, while authorization is the process of granting access 
rights. 
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3.1 General Issues 
e As shown in Fig 9.3 a general model is used to understand the issues in access control. It 
comprises subjects that issue a request to access an object. 


Request for Authorized 
operation request 
Figure9.3:General Model to control access to objects 

e The object may be an abstract entity like a process, file, database, semaphore, tree data 
structure, or a physical entity’ like CPU, memory segment, printer, tape drive, network site, etc. 

e Each object has a unique name that differentiates it from others in the system andi it is 
referenced by this unique name. 

e The object is associated with a type to determine the type of operations that can be performed 
on it. 

e Subjects are processes that act on behalf of users or they could be objects that need services 
of other objects to carry out their tasks. 

e Asubject in other words is an active entity whose access to objects needs to be controlled. 

e The entities that need to access and perform operations on objects and to which access 
authorizations are granted are called subjects. 

e Protection rules define the possible ways in which subjects and objects are allowed to interact. 

e Associated with a subject-object pair is an access right that defines the subset of possible 
operations for the object type that the subject can perform on that object. 

e The complete set of access rights defines which subjects can perform what operations on which 
objects. 

1. Protection Domains 
o Adomain is an abstract definition of a set of access rights. 
o Itis defined as a set of objects-rights pair. 
o Each pair specifies an object and one or more operations that can be performed on that 

object. 

o Each of these allowed operations is called a right. 

Fig 9.4 shows an example of a system having three protection domains D1, D2 and D3. 

o Domains may not be disjoint; the same access rights can simultaneously exist in more than 
two domains. 

o For example, the access right to Printer is in both domains Dz and D3 

o Hence, a subject in either of these domains can perform the operations specified by the 
Printer. 

o Also the same object can exist in multiple domains with different access rights in each 
domain. 


oO 
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(e) 
1) 


[e) 


Domain 1 (D;) Domain 2 (D5) 


File 1 (Read, Write, Execute) 
File 2 (Read, Write) 


File 2 (Read, Write, Execute) 


Semaphore 1 (up, down) 
File 3 (Read) 


Domain 3 (D,) 
Figure 9.4: A system with three protection domains 

For example, the rights of File 2 are different in domains D; and D2 Consider File 2, a subject 
in domain D; can perform read and write operations, while the same subject in domain D2 
can perform read, write, and execute operations. 
Hence, to execute File 2, it must be in domain Do. 
If an object exists only in a single domain, it can be accessed only by the subjects in that 
domain. For instance. 
File 3 can only be accessed by the subjects in domain D3 


2. Access Matrix 


[e) 


oO0000 0 


While implementing a domain-based approach, the system needs to keep track of which 
rights on which objects belong to a particular domain. 

In an access matrix model, this information is represented as an access matrix. 

Each subject in an access matrix is represented by a row and each column represents an 
object. 

Each entry in the matrix consists of a set of access rights. 

The entry M[S,O] defines the set of operations (S) that are allowed on an object O. 

The reference monitor checks the entry M[S,O] every time a request m is made. 

If it is not listed, the invocation fails. 

The protection state of the system is defined by the contents of the access matrix. 

The access matrix shown below represents the protection state of the system. 


Read 
Write 
Execute 


Figure 9.5:Access matrix for the protection state of the system shown in Fig 9.4 
The objects are named as: 
Object1: O1 = File 1 


Object2: O2 = File 2 
Object3: O3=File3 
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Object4: 04=Semaphore{Up, Down} 
Object5: O5 = Printer 
o The following issues need to be resolved when an access matrix is representing the 
protection state of a system: 
© Deciding the contents of the access matrix 
=" Policy decisions regarding which rights are to be included in the cell are system 
dependent. 
=" The contents of the matrix entries for user defined objects are decided by the users but 
for system-defined objects, they are defined by the system. 
= When a user creates a new object 0; column jis added to the access matrix with suitable 
entries as decided by the user’s specification of access control for the object. 
o Validating access to objects by subjects 
= Access to objects should be ensured only in the manner specified in the protection state 
of the matrix. 
= This is done by using an object monitor associated with each type of object; each 
attempted access is validated as follows: 
1. A subject S in domain D initiates r access to object O, where r belongs to the set of 
operations that can be performed on O. 
2. The protection system forms a triplet (D, r, O) and passes it to the object monitor of O. 
3. This object monitor of O looks for the operation r in the (D, 0)th entry of the access matrix. 
o Allowing subjects to switch domains in a controlled manner 
= As per the principle of least privilege, a process can be allowed to switch domains, if 
needed, so that the process is given only necessary rights. 
This should be carried out in a controlled manner, or else, a process may switch to a 
powerful domain and violate the protection policies of the system. 
=" Domain switching can be controlled by treating domains as objects and including them 
as objects in the access matrix. 
= Fig 9.6 shows access matrix of our example with three domains are Ta themselves. 


ca ea 
ue io al 
i he i 
el om || L| 
Figure 9.6:Access matrix of Figure 9.5 with domains included as objects 


= Inthe modified access matrix, domain switching from Di, to Dj is permitted if the right 
switch is present in the (Dj, Dj)th entry of the access matrix. 

= Thus, as shown in Figure 9.6, a process executing in domain D1 can switch to domain D2 
or to domain D3, but a process executing in domain D2 can switch only to D3 and not 
to D1. 

= Also, the process executing in D3 cannot switch anywhere. 
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o Allowing changes in the protection state of the system in a controlled manner 

=" To enable flexibility, the protection system may optionally allow the content of a 
domain to be changed. 

= If this facility is not provided, then a new domain can be created with the changed 
contents and switch to the new domain when needed. 

= Since each entry in the access matrix can be modified, each of these can be treated as 
an object to be protected. 

= To allow controlled change, the rights defined for the new object are copy, owner, and 
control. 

= These changes can be categorized into (i) allowing changes to column entries and (ii) 
allowing changes to row entries. 

i). Column entries 


Figure 9.7: Access matrix with copy rights 

=" The copy and owner rights allow a process to change the entries in a column. 

= The ability to copy an access right from one domain (row) to another is denoted by (*) 
to the access right. 

= A process executing in domain D1 can copy the Read and Write operations on F1 to any 
other domain. 

= While a process executing in domain D2 can copy the Read operation on F1 and Read 
and Execute operations on F2 to any other domain. None of these operations can be 
copied to any other domain. 

= If the owner right is included in the (i,j)th entry of an access matrix, the process 
executing in domain D1 can add and delete any right in any entry in column j. 

= Fig 9.7 shows the owner rights for the three objects from our earlier example. 

=" The copy right can have the following variant: 

" Transfer: When a right is copied from one cell (I, j) to another cell (k,j), the entry is 
removed from the source cell. 

=" Copy with propagation not allowed: When right R* is copied from one cell (i,j) to 
another cell (k,j), only the right R and not R* is created in the (k,j) entry. Thus, a process 
executing in the new domain D; does not have rights for further copy the right R. 

=" Copy with propagation allowed: When right R* is copied from one cell (I,j) to another 
cell (k,j) entry, the right R* is created in the (k,j) entry. Thus, a process executing in the 
new domain D, can further copy the right R* or R. 
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Execute* 
Owner 


Read-only 
Owner 


Figure 9.8:Access matrix with owner rights 

= The owner right is used to allow adding/deleting rights to column entries in a controlled 
manner. 

= Ifthe owner right is included in a particular cell, a process executing in that domain can 
add or delete any right in any entry in that column. 

=" Consider this example. As shown. D1, D2, D3 have owner rights for objects F1, F2, and 
F3 respectively. 

=" Hence, a process executing in domain D2, can add and delete any valid right of F1 in any 
entry in column F1. 

= Similarly, a process executing in domain D2 can add and delete any valid right of F2 in 
any entry in column F2 and a process executing in domain D3 can add and delete any 
valid right of F3 in any entry in column F3 

= This is illustrated in Fig 9.8 

ii). Row entries 


Read Read pat ba 
Write Write Control |Control 
Execute 
Read Switch 
Write Control 
Execute 


Figure9.9: Access matrix with control rights 

=" The control right, which is applicable to domain objects, is used to allow a process to 
change the entries in a row. 

= Ifthe control right is present in (Di, Dj) th entry of the access matrix, a process executing 
in domain Di, can remove any access right from row Dj. 

= Fig 9.9 shows the access matrix with control rights included. 

= Since the entries (D1 D2) and (D1, D3) have control rights, a process executing in domain 
D, can delete any right from rows D2 and D3 
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# Similarly, since the entry (D2, D3) has control rights, a process executing in domain D2 
can delete any right from row D3 
o Implementation of access matrix 
= An access matrix is large and sparse. 
= Most domains have no access to all the objects, hence many entries are empty. 
= Direct implementation of the access matrix will result in disk space wastage. 
=" The implementation issue can become complicated in distributed systems if the 
subjects and objects are located on different nodes. 
= Two methods are commonly used: access control lists (ACL) and capabilities. 
= Access control list methods decompose the access matrix by columns, whereas 
capability-based methods decompose the matrix by rows. 
o Access control lists 
=" The access matrix is decomposed by columns and each column of the matrix is 
implemented as an access list for the object corresponding to the column. 
= The empty entries of the matrix are not stored in the access list. 
= For each object, a list of ordered pairs (domain, rights) is maintained. 
= This defines all domains with a non-empty set of access rights for that object. 


3.2 Firewalls 


External access to any part of a distributed system is controlled with a special kind of reference 
monitor called a firewall. 

This monitor isolates the distributed system from the external world. 

All outgoing and incoming packets are routed through a special computer and inspected before 
they are passed. 

Thus, a firewall is a set of related programs located at a network gateway server that protects 
the resources of a private network from the users of other networks. 

The firewall itself should be thoroughly protected against any security threat and should never 
fail. 

Basically, a firewall works closely with a router program and examines each network packet to 
determine whether or not to forward it towards its destination. 

A firewall also includes or works with a proxy server that makes network requests on behalf of 
workstation users. 

A firewall is often installed in a specially designated computer separate from the rest of the 
network so that no incoming request can get directly at private network resources. 


Packet oe Packet 
filtering Application filtering 
router § gateway router 


-——= ee ee 


5 I 
Connections “~__ 


to internal 


Connections 
to outside 


ee networks 


i] 
Inside LAN Outside LAN! 
| Firewall 


Oa Fr a? 


Figure 9.10: Firewall 
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There are two categories of firewalls. 

As shown in Fig 9.10, the first is a packet filtering gateway that operates as a router. 

It makes the decision of whether or not to pass a network packet based on the source and 
destination addresses in the packet. 

The other type of firewall is the application-level gateway. 

This type of firewall inspects the contents of the incoming and outgoing packet. 

For example, a mail gateway can discard incoming and outgoing messages exceeding a certain 
size. 


3.3 Secure Mobile Code 


An important requirement of distributed systems is their capability to migrate codes between 

hosts instead of just passive data. 

Mobile codes lead to various security threats. 

While sending an agent across the network, there is a need to protect the code from malicious 

hosts which may try to steal or modify the information carried by the agent. 

The hosts also need to be protected against malicious agents. 

Protecting an agent 

o Consider a situation where the mobile code is roaming in the system on behalf of the user 
and the agent is carrying an electronic credit card. 

o When the agent moves to a host, the host should not steal the credit card information nor 
should the agent modify the host’s data. 

o Other examples that require the protection of an agent against attacks from the host 
include destroying an agent, tampering with the agent, etc. 

o Tampering may lead to the agent stealing from its owner after it returns. 

Protecting the target 

o When we allow a mobile code to enter a host, it is equally important to protect the host 
from a malicious code. 

o A.user has to allow an agent into the system and then the host should be able to exercise 
full control over what the agent can do. 

o There are two suggested approaches. 

o One is to create a sandbox and the other is to have a playground. 

o A sandbox is a technique by which the execution of the downloaded code can be fully 
controlled. Any attempt by the foreign code to execute a forbidden instruction halts the 
execution. 

o Sandbox 
= Fig 9.11 shows the sandbox organization. 
= Each Java program consists of a number of classes from which objects are created. 
= Each variable is declared as a part of a class. Program execution starts from a method 

called main. 
= A Java program can be executed only if the JVM is running on that machine. 
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class 
object 

Java program 
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loader 
Figure9.11: Sandbox Organization 


=" InaJava sandbox, the first step to ensure protection is by ensuring that the components 
transferring the program to the client machine can be trusted. 

=" These components are termed as class loaders. 

=" A class loader fetches a specified class from the server and installs it in the client’s 
address space so that the JVM can create objects from it. 

=" A sandbox allows only the trusted class loaders. 

=" In other words, a Java program is not allowed to create its own class loaders. 

=" Byte code verifier is the next check to ensure that the host is protected from the agent. 

=" The verifier checks that the class contains no illegal instructions or instructions that 
could corrupt the stack or memory. 

= After checking that a class has been securely downloaded and verified, the JVM can 
instantiate objects from it and execute these objects' methods. 

=" The security manager plays the role of a reference monitor. 

= Atypical security manager checks I/O operations for validity and denies access to local 
files. 

o Playground 

=" Aplayground is a separate, designated machine exclusively reserved for running mobile 

codes. 


Only trusted Untrusted 
code code 


| Play- 
ground 


Local network 
Figure9.12: Playground 
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= Local resources such as files or network connections to the external servers are available 
to programs executing in the playground. 

= They are subject to normal protection mechanisms. 

=" The resources local to other machines are physically disconnected and cannot be 
accessed by the downloaded code. 

= Users on these machines can access the playground through RPCs. 


4. Security Management 
e Security is implemented using symmetric and asymmetric keys. 
e There are three main issues: how to manage cryptographic keys, how to manage securely a 
group of servers such that a malicious process is not added to the group, and finally 
authorization management by capabilities and attribute certificates 


4,1 Key Management 
e Indifferent cryptosystems keys were readily available. 
e Inthe case of authentication, each party shares a key with the KDC. 
e The main problem is how to distribute the keys securely. 
e The second issue is to | revoke keys that are compromised or invalidated. 
Key establishment 
o. First, let us consider how to set up a session’s key. 
o When Anant wants to have a secure channel of communication with Balwant. 
o He uses Balwant’s public key to initiate communication 
o If Balwant accepts, he generates a session’s key and sends it to Anant after encrypting it 
with Anant’s public key. 
o Similarly, a session key can be generated and distributed if both Anant and Balwant share a 
secret key. 
o It means that they both have the means to establish a secure channel. 
o Asecure channel is necessary to share the secret key with the help of KDC. 


Diffie-Hellman key exchange 
o It is a simple and popular scheme for sending a shared secret key across an insecure 
channel. 


Ann picks » Baby picks y 


tq” mod ny" = gt mod n ig” mod np’ = g*¥ mod 
Figure9.13: Diffie Hellman key exchange 
o The protocol works as follows: 
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o Suppose Ann and Baby want to establish a secret key. 
=" They agree on two large numbers n and g. Both can be made public. 
= Ann picks up a large random number x and Baby picks up y which are kept secret. 
= Ann sends g* mod nto Baby with n and g. 
= Baby calculates (g * mod n)” = g*Y mod n 
=" Baby knows g, n and y, so she sends g Y mod nto Ann. 
=» Ann computes (g Ymod n)* = g*¥mod n 
o Thus, both Ann and Baby have a shared secret key g *¥ mod n that is known only to them 
and has been sent across a non-secure channel 
2. Issues in Key Distribution 
(a) Key distribution in symmetric cryptosystem 
o If two users on different nodes want to communicate using a symmetric cryptosystem, they 
must first share the encryption / decryption key that needs to be transmitted over the 
physical insecure medium. 
o Hence, the key must be encrypted before transmission. 
o Usually, a small number of keys are distributed earlier using a server process managed by a 
key distribution centre (KDC). This is a trusted entity and shared by all communicating users. 
o Also, there may be more than one KDC in the system. Fig 9.14 shows the principle of KDC. 


Figure9.14: Principle of KDC 
o The various implementation approaches are: centralized, fully distributed, and partially 
distributed approach. 
o Centralized approach 
= In this approach, a single KDC maintains a table of secret keys for each user. 
=" A user A makes a request to the KDC (in plaintext with its userid) indicating that it wants 
a secure communicating channel with user B. 
=" The KDC extracts the key value corresponding to the userid and creates a secret key for 
secure communication between user A and B and sends it to user /I. 
=" On receiving the message, user A decrypts the key after confirming that it is matching 
with the original request. 
= Now the user uses this key and sends a message to user B who decrypts with its private 
key and retrieves the secret key. 
= Now both users A and B use the secret key for message transmission. 
=" The centralized approach is simple and easy to implement. 
=" The drawbacks are poor reliability and performance bottleneck due to a single KDC. 
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o Fully distributed approach 
=" The KDC resides at each node in the distributed system and the secret keys are 
distributed well in advance. 
=" Thus, all KDCs can communicate with each other, the KDC has a table of secret keys with 
private keys of all other KDCs For a system with n nodes, each KDC has n - 1 keys and a 
total of n(n - 1 )/2 key pairs exist in the system. 
= To establish a secure logical communication channel, user A makes a request to the local 
KDC (plaintext form). 
=" The KDC extracts the keys corresponding to the users and creates a secret key for secure 
communication and sends it to user A and B. 
=" On receiving the message, user A verifies that the message matches with the original 
and keeps the key for future use. 
= Similarly, user B decrypts the message with its private key and extracts the secret key. 
=" Now user B initiates the authentication procedure, authenticates user A and then, 
proceeds with secure communication. 
o. Partially Distributed 
= Here the nodes are partitioned into regions and each region has a KDC. 
=" The prior distribution of secret keys allows each KDC to communicate securely with each 
user of its own region and with KDCs of other regions. 
=" Each KDC has a table of secret keys that contains private keys of all users of its own 
region and of all other KDCs. 
= The distribution of a key for establishing a secure communication channel depends on 
the location of the two users: whether they belong to the same region or to two 
different regions. 
= Inthe former case, the key distribution is similar to the centralized approach. 
= Ifthe users belong to different regions, the key distribution procedure is exactly similar 
to the fully distributed approach. 
(b) Key Distribution in Asymmetric system 
o Inan asymmetric cryptosystem, only public keys are distributed which are not kept secret 
and hence can be transmitted over an insecure communication channel. 
o Inanasymmetric crypto system, the key distribution procedure involves an authentication 
procedure to prevent an intruder from generating a pair of keys. 
o A common method is to use a public- key manager (PKM) that maintains a directory of 
public keys of all users in the system. 
o The public key is known to all users, while the secret key is known only to the PKM. 
© Suppose user A wants to communicate with user B. then it sends a request message (cipher 
text) to the PKM to establish a secure logical communication channel with user B. 
o PKM decrypts the message, extracts the public key corresponding to the user IDs and sends 
it (cipher text form) to user A. 
o On receiving the message, user A decrypts the message and confirms that the message is a 
reply for its request. 
o Next, it sends a message to user B who decrypts the message and understands that it is a 
request for communication. 
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To authenticate the correct public key, user B sends a request to the PKM, gets the public 
key and authenticates user A. 

This allows regular communication to start. 

Authenticated distribution of public keys takes place through public-key certificates. 
These certificates consist of public keys and the identities to which the keys are associated. 
A certification authority signs the <public key, identified> pair and places it on the 
certificate. 

A private key (Kea -) of the certification authority is used to sign the certificate. Its corre- 
sponding public key (Kca +) is known. 

If a client wishes to verify that the public key found in the certificate belongs to the 
identified entity, then it uses the public key of the certification authority to verify the 
certificate’s signature. 

If the signature on the certificate matches the <public key, identified pair>, then it accepts 
that the public key belongs to that entity. 


(c) Secure Group Management 


O 


oO 


O 
O 


Many security systems make use of key distribution centres (KDCs) and certification 
authorities (CAs). 

There are a few issues when these are deployed in distributed systems. 

First they must be trusted. 

To ensure complete trust in security services, there is a need to provide high protection 
against all kinds of security threats. 

Consider a situation where if a CA is c Security services must also offer high availability. 

To set up a secure channel for communication between two processes, at least one of them 
needs to contact the KDC for a shared secret key. 

If the KDC is not available, a secure communication cannot be established. 

One of the solutions for high availability is replication ccompromised, it will become difficult 
to verify the validity of the public key, making the entire security system useless. 


(d) Authorization Management 


oO 


oO 


A distributed system has resources spread across the system and this makes access rights 

management difficult. 

Access validation 

=" Capability is data structure for a particular resource, specifying the access rights which 
the holder of the capability has for that resource. 

=" There can be several capabilities for the same object, each conferring different access 
rights to its holders. 

= Possessing a capability implies that access is permitted in the modes associated with the 
capability to execute an operation r on an object. 

=" Thus, a process executes the operation r specifying the capability for object O as a 
parameter. 

= When the monitor for object O receives the access request, it verifies that the rights 
information part of the capability has permission for operation r. 
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=" Once a process establishes the possession of a capability for an object, it can access the 
object on one of the modes allowed by the capability without any other check by the 
security system. 
o Granting and Passing Rights 
= Each user maintains a list of capabilities that identifies all objects the user can access 
and the associated access permissions. 
=" The capability-based security system uses one or more object managers for each object 
type. 
= All requests to create an object or to perform some operation on the object is sent to 
one of the object managers of that object type. 
= When anew object is created, the corresponding object manager generates a capability 
with all rights of access for the object and the generated capability is returned to the 
owner for use. 
=" Subsequently, the owner will give the object’s capability to other users with whom the 
object is being shared. 
= But the owner can restrict the capability given to different users by removing some of 
their rights. 
=" This is done by using a function in the object manager to restrict capabilities. 
=" On being called, the object manager generates a new capability with only the desired 
access permissions and passes it to the caller which is then given to the users for whom 
it was generated. 
o Protecting capabilities against unauthorized access 
o The following requirements need to be met to protect the objects against unauthorized 
access: 
=" Acapability must uniquely identify an object in the entire system. 
=" The capability must not be reused after the object is deleted and the system must 
trigger an error when an obsolete capability is used. 
=" Capabilities need to be protected from user tampering 
=" Hence capabilities must be treated as unforgettable protected objects maintained by 
the OS and only indirectly accessed by the users. 
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1. Java RMI 


Java hides the difference between local and remote method invocation from the user. 
After marshalling the type of object, it is passed as a parameter to RMI in Java. 
Machine A Machine B 


Remote 
reference 
to Ri 


Local 
reference 
to Li 


Client code with 
RMI to server at C i 


Machine ¢ 


Copy of O1 


Remote invocation 
with L1 and Ril 
aS parameters 


Server code 
with method 
implementation 
Figure10.1: Java remote object 


Fig 10.1 shows how the JAVA RMI is implemented. 

The reference to remote object consists of the network address and endpoint of server. 

It also contains the local ID for the actual object in the server address space. 

Each object in Java is an instance of a class, which contains the implementation of one or more 

interfaces. 

A java remote object is built from two classes (i) server class and (ii) client class. 

Server class: 

o This class contains the implementation of server side code, i.e objects which run on the 
server. 

o It also consists of a description of the object state and implementation of methods which 
operate on that state. 

o The skeleton on the server side stub is generated from the interface specification of the 
object. 

Client Class: 

o This class contains the implementation of client side code and proxy. 

o It is generated from the object interface specification. 

o The proxy basically converts each method call into message that is sent to the server side 
implementation of the remote object. 

Communication is set up to the server and is cut down when a call is complete. 

The proxy state contains the server’s network address, end to the server and local ID of the 

object. 

Hence, a proxy consists of sufficient information to invoke the methods of a remote object. 
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machine 


Figure10.2:Java RMI layer 
e Proxies are serializable in Java, marshalled and sent as series of bytes to another process where 
it is unmarshalled , and can invoke methods in remote objects . 
e Java RMI layer is depicted in Fig 10.2. 


2 SUN Network File System 


Client Server 


Virtual file system 
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Local file Local file 


system interface system interface 


Network 
Figure10.3 : Architecture of NFS 

e The architecture of NFS is as shown in Figure 10.3 

e The basic idea behind NFS is that each file server provides a standardized view of its local file 
system. 

e Aclient accesses the file system using the system calls provided by its local operating system. 

e However, the local UNIX file system interface is replaced by an interface to the Virtual File 
System (VFS). 

e With NFS, operations on the VFS interface are either passed to a local file system, or passed to 
a separate component known as the NFS client, which takes care of handling access to files 
stored at a remote server. 

e In NFS, all client-server communication is done through RPCs. 

e The NFS client implements the NFS file system operations as RPCs to the server. 

e On the server side the NFS server is responsible for handling incoming client requests. 

e The RPC stub unmarshals requests and the NFS server converts them to regular VFS file 
operations that are subsequently passed to the VFS layer. 
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e Again, the VFS is responsible for implementing a local file system in which the actual files are 
stored. 

e The file system model offered by NFS is almost the same as the one offered by UNIX-based 
systems. 

e Files are treated as uninterpreted sequences of bytes. 

e They are hierarchically organized into a naming graph in which nodes represent directories and 
files. 

e NFS also supports hard links as well as symbolic links, like any UNIX file system. 

e Toaccess a file, a client must first look up its name in a naming service and obtain the associated 
file handle. 

e Each file has a number of attributes whose values can be looked up and changed. 


2.1 RPC’s in NFS 
e Every NFS operation can be implemented as a single remote procedure call to a file server. 
e In order to read data from a file for the first time, a client normally first had to look up the file 
handle using the lookup operation, after which it could issue a read request, as shown in Fig. 


10.4 (a) 
Client Server Client Server 
LOOKUP 
OPEN 
LOOKUP READ 


‘ Lookup name 2 Lookup name 


. Open file 


”\ Read file data 


, Read file data < 
Pd Time 


' Y 
(a) (b) 

Figure10.4:(a) Reading data from a file in NFS version 3. (b) Reading data using a compound procedure in version 4. 

e This approach required two successive RPCs. 

e The drawback became apparent when considering the use of NFS in a wide-area system. 

e Toovercome such problems, NFSv4 supports compound procedures by which several RPCs can 
be grouped into a single request, as shown in Fig. 10.4(b). 

e Inthe case of version 4, it is also necessary to open the file before reading can take place. 

e After the file handle has been looked up, it is passed to the open operation, after which the 
server continues with the read operation. 

e The overall effect in this example is that only two messages need to be exchanged between the 
client and server. 
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2.2 Naming in NFS 


e The fundamental idea underlying the NFS naming model is to provide clients complete 


transparent access to a remote file system as maintained by a server. 
Client A Server Client B 


Exported directory Exported directory 
mounted by client mounted by client 


Network 
Figure10.5: Mounting (part of) a remote file system in NFS. 
e This transparency is achieved by letting a client be able to mount a remote file system into its 


own local file system, as shown in Fig. 10.5. 
e Instead of mounting an entire file system, NFS allows clients to mount only part of a file system 


as also shown in Fig. 10.5. 
e A server is said to export a director when it makes that directory and its entries available to 


clients. 
e Anexported directory can be mounted into a client's local name space. 


e NFS also supports mounting of nested directories as shown in Fig 10.6 


Exported directory 
contains imported 
subdirectory ; Server B 


Client 


Client 
imports 
directory 
Server A 
imports 


directory 


Client needs to 
explicitly import 
subdirectory from 
server B 
Figure10.6: Mounting nested directories from multiple servers in NFS. 
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Automounting 

o On-demand mounting of a remote file system (or actually an exported directory) is handled 
in NFS by an automounter, which runs as a separate process on the client's machine. 

o The principle underlying an automounter is relatively simple. 

o Consider asimple automounter implemented as a user-level NFS server on a UNIX operating 
system. 

o Assume that for each user, the home directories of all users are available through the local 
directory /home, 

o When aclient machine boots, the automounter starts with mounting this directory. 

© The effect of this local mount is that whenever a program attempts to access /home, the 
UNIX kernel will forward a lookup operation to the NFS client 

o Which in this case, will forward the request to the automounter in its role as NFS server. as 


shown in Fig. 10.7 
Client machine Server machine 


1. Lookup "/home/alice" 


3. Mount request 


4. Mount subdir “alice" 
from server 


Figure10.7: A simple automounter for NFS. 
o For example, suppose that Alice logs in. The login program will attempt to read the directory 
/home/alice to find information such as login scripts. 
© The automounter will thus receive the request to look up subdirectory /home/alice, for 
which reason it first creates a subdirectory /alice in /home. 
o It then looks up the NFS server that exports Alice's home directory to subsequently mount 
that directory in /home/alice. 


2.3 File Locking 
e File locking in NFSv4 is simple. There are essentially only four operations related to locking, as 
shown in Table 10.1. 
e NFSv4 distinguishes read locks from write locks. 
e Multiple clients can simultaneously access the same part of a file provided they only read data. 
e Awrite lock is needed to obtain exclusive access to modify part of a file. 
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Table 10.1: NFSv4 operations related to file locking. 

e Operation lock is used to request a read or write lock on a consecutive range of bytes ina file. 

e It is anonblocking operation: if the lock cannot be granted due to another conflicting lock, the 
client gets back an error message and has to poll the server at a later time. 

e The client can request to be put on a FIFO-ordered list maintained by the server. 

e Assoonas the conflicting lock has been removed, the server will grant the next lock to the client 
at the top of the list, provided it polls the server before a certain time expires. 

e This approach prevents the server from having to notify clients, while still being fair to clients 
whose lock request could not be granted because grants are made in FIFO order. 


2.4 Caching 

e Caching in NFSv3 has been mainly left outside of the protocol. 

e This approach has led to the implementation of different caching policies, most of which never 
guaranteed consistency. 

e At best, cached data could be stale for a few seconds compared to the data stored at a server. 

e NFSv4 solves some of these consistency problems, but essentially still leaves cache consistency 
to be handled in an implementation-dependent way. 

e The general caching model that is assumed by NFS is shown in Fig. 10.8. 

e Each client can have a memory cache that contains data previously read from the server. 

e In addition, there may also be a disk cache that is added as an extension to the memory cache, 
using the same consistency parameters 


application 


Network 
Figure 10.8: Client-side caching in NFS. 
e The simplest approach is when a client opens a file and caches the data it obtains from the 
server as the result of various read operations. 
e In addition, write operations can be carried out in the cache as well. 
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e When the client closes the file, NFS requires that if modifications have taken place than cached 
data must be flushed back to the server. 

e This approach corresponds to implementing session semantics as discussed earlier. 

e Once (part of) a file has been cached, a client can keep its data in the cache even after closing 
the file. 

e Also, several clients on the same machine can share a single cache. 

e NFS requires that whenever a client opens a previously closed file that has been (partly) cached, 
the client must immediately revalidate the cached data. 

e Revalidation takes place by checking when the file was last modified and invalidating the cache 
in case it contains stale data. 


3. Google File System 
3.1 Introduction 
e The Google File System is a scalable distributed file system for large distributed data-intensive 
applications. 
e It provides fault tolerance while running on inexpensive commodity hardware, and it delivers 
high aggregate performance to a large number of clients 


3.2 Assumptions/Features 

e The system is built from many inexpensive commodity components that often fail. 

e It constantly monitor itself and detect, tolerate, and recover promptly from component failures 
on a routine basis. 

e The system stores a modest number of large files. 

e Weexpect a few million files, each typically 100 MB or larger in size. 

e The workloads primarily consist of two kinds of reads: large streaming reads and small random 
reads. 
o In large streaming reads, individual operations typically read hundreds of KBs, more 

commonly 1 MB or more. 

o Asmall random read typically reads a few KBs at some arbitrary offset. 

e The workloads also have many large, sequential writes that append data to files. 

e The system must efficiently implement well-defined semantics for multiple clients that 
concurrently append to the same file. 

e Atomicity with minimal synchronization overhead is essential. 

e High sustained bandwidth is more important than low latency. 


3.3 Architecture 


e AGFS cluster consists of a single master and multiple chunk servers and is accessed by multiple 
clients, as shown in Figure 10.9 
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Figure 10.9: GFS Architecture 

e Each of these is typically a commodity Linux machine running a user-level server process. 

e Files are divided into fixed-size chunks. 

e Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by 
the master at the time of chunk creation. 

e Chunk servers store chunks on local disks as Linux files and read or write chunk data specified 
by a chunk handle and byte range. 

e For reliability, each chunk is replicated on multiple chunk servers. 

e By default, we store three replicas, though users can designate different replication levels for 
different regions of the file namespace. 

e The master maintains all file system metadata. 

e This includes the namespace, access control information, the mapping from files to chunks, and 
the current locations of chunks. 

e It also controls system-wide activities such as chunk lease management, garbage collection of 
orphaned chunks, and chunk migration between chunk servers. 

e The master periodically communicates with each chunk server in HeartBeat messages to give 
it instructions and collect its state. 

e Clients interact with the master for metadata operations, but all data-bearing communication 
goes directly to the chunk servers. 

e The interactions for a simple read with reference to Figure 10.9. 
o First, using the fixed chunk size, the client translates the file name and byte offset specified 

by the application into a chunk index within the file. 

Then, it sends the master a request containing the file name and chunk index. 

The master replies with the corresponding chunk handle and locations of the replicas. 

The client caches this information using the file name and chunk index as the key. 

The client then sends a request to one of the replicas, most likely the closest one. 

The request specifies the chunk handle and a byte range within that chunk. 

Further reads of the same chunk require no more client-master interaction until the cached 

information expires or the file is reopened. 

In fact, the client typically asks for multiple chunks in the same request and the master can 

also include the information for chunks immediately following those requested. 


OoOO000 0 0 


oO 
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o This extra information sidesteps several future client-master interactions at practically no 
extra cost. 


3.4 Chunk Size 
e Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger 
than typical file system block sizes. 
e A large chunk size offers several important advantages. 

o It reduces clients’ need to interact with the master because reads and writes on the same 
chunk require only one initial request to the master for chunk location information. 

o Since ona large chunk, a client is more likely to perform many operations on a given chunk, 
it can reduce network overhead by keeping a persistent TCP connection to the chunk server 
over an extended period of time. 

o It reduces the size of the meta-data stored on the master. 

e A large chunk size offers several disadvantages 

o Asmall file consists of a small number of chunks, perhaps just one. 

o The chunk servers storing those chunks may become hot spots if many clients are accessing 
the same file. 


3.5 Metadata 

e The master stores three major types of metadata: the file and chunk namespaces, the mapping 
from files to chunks, and the locations of each chunk’s replicas. 

e All metadata is kept in the master’s memory. 

e The first two types (namespaces and file-to-chunk mapping) are also kept persistent by logging 
mutations to an operation log stored on the master’s local disk and replicated on remote 
machines. 

e The master does not store chunk location information persistently. 

e Instead, it asks each chunk server about its chunks at master startup and whenever a chunk 
server joins the cluster. 


3.6 Leases and Mutation Order 

e Mutation is an operation that changes the contents or metadata of a chunk such as a write or 
an append operation. 

e Each mutation is performed at all the chunk’s replicas. 

e We use leases to maintain a consistent mutation order across replicas. 

e The master grants a chunk lease to one of the replicas, which we call the primary. 

e The primary picks a serial order for all mutations to the chunk. 

e All replicas follow this order when applying mutations. 

e Thus, the global mutation order is defined first by the lease grant order chosen by the master, 
and within a lease by the serial numbers assigned by the primary. 

e In Fig 10.10, we illustrate this process by following the control flow of a write through these 
numbered steps. 
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Figure 10.10: Write Control and Data Flow 
1. The client asks the master which chunk server holds the current lease for the chunk and the 
locations of the other replicas. 
= If no one has a lease, the master grants one to a replica it chooses. 
2. The master replies with the identity of the primary and the locations of the other 
(secondary) replicas. 
=" The client caches this data for future mutations. 
= It needs to contact the master again only when the primary becomes unreachable or 
replies that it no longer hold a lease. 
3. The client pushes the data to all the replicas. A client can do so in any order. 
= Each chunk server will store the data in an internal LRU buffer cache until the data is 
used or aged out. 
4. Once all the replicas have acknowledged receiving the data, the client sends a write 
request to the primary. 
=" The request identifies the data pushed earlier to all of the replicas. 
= The primary assigns consecutive serial numbers to all the mutations it receives, possibly 
from multiple clients, which provides the necessary serialization. 
= It applies the mutation to its own local state in serial number order. 
5. The primary forwards the write request to all secondary replicas. 
= Each secondary replica applies mutations in the same serial number order assigned by 
the primary. 
6. The secondaries all reply to the primary indicating that they have completed the 
operation. 
7. The primary replies to the client. 
= Any errors encountered at any of the replicas are reported to the client. 
= Incase of errors, the write may have succeeded at the primary and an arbitrary subset 
of the secondary replicas. 
=" The client request is considered to have failed, and the modified region is left in an 
inconsistent state. 
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=" Our client code handles such errors by retrying the failed mutation. 
= It will make a few attempts at steps (3) through (7) before falling back to a retry from 
the beginning of the write. 
If a write by the application is large or straddles a chunk boundary, GFS client code breaks it 
down into multiple write operations. 
They all follow the control flow described above but may be interleaved with and overwritten 
by concurrent operations from other clients. 


3.7 Creation, Re-replication, Rebalancing 


Chunk replicas are created for three reasons: chunk creation, re-replication, and rebalancing. 

When the master creates a chunk, it chooses where to place the initially empty replicas. It 

considers several factors. 

1. Place new replicas on chunk servers with below-average disk space utilization. Over time 
this will equalize disk utilization across chunk servers. 

2. Limit the number of “recent” creations on each chunk server. 

3. Spread replicas of a chunk across racks. 

The master re-replicates a chunk as soon as the number of available replicas falls below a user- 

specified goal. 

The master picks the highest priority chunk and “clones” it by instructing some chunk server to 

copy the chunk data directly from an existing valid replica. 

The new replica is placed with goals similar to those for creation: equalizing disk space 

utilization, limiting active clone operations on any single chunk server, and spreading replicas 

across racks. 

The master rebalances replicas periodically: it examines the current replica distribution and 

moves replicas for better disks pace and load balancing 
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